Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Pandas Dataframe – Finding Duplicates of One Column But Different in Another Column

I have a Pandas dataframe, for example, like this:

idx A B
0 a1 b1
1 a2 b1
2 a2 b2
3 a2 b1
4 a3 b3
5 a3 b3
6 a4 b1

I want to find the duplicated values in Column A, but different values in Column B, and select all the indexes.

In above example, the results should be:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

idx A B
1 a2 b1
2 a2 b2
3 a2 b1
  • Drop idx 0 and 6, the values in Column A are unique.
  • Drop idx 4 and 5, because the values in Column B are the same.
  • I want to keep both idx 1 and 3 in the results, although they are the same, but they have a different value in idx 2 (not all the same).

How can I achieve this goal?

>Solution :

You can use two groupby.transform for boolean indexing:

g = df.groupby('A')['B']

# is A duplicated and are the duplicates non-unique?
out = df[g.transform('count').gt(1) & g.transform('nunique').gt(1)]

# the non-unique condition is however implying the duplication of A
# we can simplify to:
out = df[df.groupby('A')['B'].transform('nunique').gt(1)]

Or, with isin:

s = df.groupby('A')['B'].nunique()

out = df[df['A'].isin(s[s>1].index)]

Output:

   idx   A   B
1    1  a2  b1
2    2  a2  b2
3    3  a2  b1
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading