I have a dataframe such as
Groups Names
G1 A
G1 A
G1 B
G1 B
G1 C
G1 C
G1 C
G1 D
G2 A
G2 B
G2 C
G3 A
G3 A
G4 F
G4 F
G4 E
And I would like to count for each Groups the number of duplicated (at least 2 times) of values within the column Names and add this information on a new column called Nb_duplicated. And I would like also to add another column called Number_unique_names which will be the number of unique Names values within each Groups.
I should then get:
Groups Names Nb_duplicated Number_unique_names
G1 A 3 4
G1 A 3 4
G1 B 3 4
G1 B 3 4
G1 C 3 4
G1 C 3 4
G1 C 3 4
G1 D 3 4
G2 A 0 3
G2 B 0 3
G2 C 0 3
G3 A 1 1
G3 A 1 1
G4 F 1 2
G4 F 1 2
G4 E 1 2
>Solution :
You can use compute the number of unique and the number of non-duplicated names (with GroupBy.transform), then subtract the two to get the number of duplicated:
# set up group
g = df.groupby('Groups')
# get unique values
df['unique'] = g['Names'].transform('nunique')
# get non-duplicates
non_dup = g['Names'].transform(lambda x: (~x.duplicated(False)).sum())
# duplicates = unique - non-duplicates
df['duplicated'] = df['unique'] - non_dup
NB. I used an intermediate variable "non_dup" here for clarity but you can use a one-liner
output (with intermediate non_dup for clarity):
Groups Names unique duplicated non_dup
0 G1 A 4 3 1
1 G1 A 4 3 1
2 G1 B 4 3 1
3 G1 B 4 3 1
4 G1 C 4 3 1
5 G1 C 4 3 1
6 G1 C 4 3 1
7 G1 D 4 3 1
8 G2 A 3 0 3
9 G2 B 3 0 3
10 G2 C 3 0 3
11 G3 A 1 1 0
12 G3 A 1 1 0
13 G4 F 2 1 1
14 G4 F 2 1 1
15 G4 E 2 1 1