Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Add column with number of duplicated values within groups and number of unique values within groups in pandas

I have a dataframe such as

Groups Names
G1     A
G1     A
G1     B
G1     B
G1     C
G1     C
G1     C
G1     D
G2     A
G2     B
G2     C
G3     A
G3     A
G4     F
G4     F
G4     E

And I would like to count for each Groups the number of duplicated (at least 2 times) of values within the column Names and add this information on a new column called Nb_duplicated. And I would like also to add another column called Number_unique_names which will be the number of unique Names values within each Groups.

I should then get:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Groups Names Nb_duplicated Number_unique_names
G1     A     3             4
G1     A     3             4
G1     B     3             4
G1     B     3             4
G1     C     3             4
G1     C     3             4
G1     C     3             4
G1     D     3             4
G2     A     0             3
G2     B     0             3
G2     C     0             3
G3     A     1             1
G3     A     1             1
G4     F     1             2
G4     F     1             2
G4     E     1             2

>Solution :

You can use compute the number of unique and the number of non-duplicated names (with GroupBy.transform), then subtract the two to get the number of duplicated:

# set up group
g = df.groupby('Groups')
# get unique values
df['unique'] = g['Names'].transform('nunique')
# get non-duplicates
non_dup = g['Names'].transform(lambda x: (~x.duplicated(False)).sum())
# duplicates = unique - non-duplicates
df['duplicated'] = df['unique'] - non_dup

NB. I used an intermediate variable "non_dup" here for clarity but you can use a one-liner

output (with intermediate non_dup for clarity):

   Groups Names  unique  duplicated           non_dup
0      G1     A       4           3                 1
1      G1     A       4           3                 1
2      G1     B       4           3                 1
3      G1     B       4           3                 1
4      G1     C       4           3                 1
5      G1     C       4           3                 1
6      G1     C       4           3                 1
7      G1     D       4           3                 1
8      G2     A       3           0                 3
9      G2     B       3           0                 3
10     G2     C       3           0                 3
11     G3     A       1           1                 0
12     G3     A       1           1                 0
13     G4     F       2           1                 1
14     G4     F       2           1                 1
15     G4     E       2           1                 1
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading