Pandas Dataframe – Finding Duplicates of One Column But Different in Another Column

Advertisements

I have a Pandas dataframe, for example, like this:

idx A B
0 a1 b1
1 a2 b1
2 a2 b2
3 a2 b1
4 a3 b3
5 a3 b3
6 a4 b1

I want to find the duplicated values in Column A, but different values in Column B, and select all the indexes.

In above example, the results should be:

idx A B
1 a2 b1
2 a2 b2
3 a2 b1
  • Drop idx 0 and 6, the values in Column A are unique.
  • Drop idx 4 and 5, because the values in Column B are the same.
  • I want to keep both idx 1 and 3 in the results, although they are the same, but they have a different value in idx 2 (not all the same).

How can I achieve this goal?

>Solution :

You can use two groupby.transform for boolean indexing:

g = df.groupby('A')['B']

# is A duplicated and are the duplicates non-unique?
out = df[g.transform('count').gt(1) & g.transform('nunique').gt(1)]

# the non-unique condition is however implying the duplication of A
# we can simplify to:
out = df[df.groupby('A')['B'].transform('nunique').gt(1)]

Or, with isin:

s = df.groupby('A')['B'].nunique()

out = df[df['A'].isin(s[s>1].index)]

Output:

   idx   A   B
1    1  a2  b1
2    2  a2  b2
3    3  a2  b1

Leave a ReplyCancel reply