I have a Pandas dataframe, for example, like this:
| idx | A | B |
|---|---|---|
| 0 | a1 | b1 |
| 1 | a2 | b1 |
| 2 | a2 | b2 |
| 3 | a2 | b1 |
| 4 | a3 | b3 |
| 5 | a3 | b3 |
| 6 | a4 | b1 |
I want to find the duplicated values in Column A, but different values in Column B, and select all the indexes.
In above example, the results should be:
| idx | A | B |
|---|---|---|
| 1 | a2 | b1 |
| 2 | a2 | b2 |
| 3 | a2 | b1 |
- Drop idx 0 and 6, the values in Column A are unique.
- Drop idx 4 and 5, because the values in Column B are the same.
- I want to keep both idx 1 and 3 in the results, although they are the same, but they have a different value in idx 2 (not all the same).
How can I achieve this goal?
>Solution :
You can use two groupby.transform for boolean indexing:
g = df.groupby('A')['B']
# is A duplicated and are the duplicates non-unique?
out = df[g.transform('count').gt(1) & g.transform('nunique').gt(1)]
# the non-unique condition is however implying the duplication of A
# we can simplify to:
out = df[df.groupby('A')['B'].transform('nunique').gt(1)]
Or, with isin:
s = df.groupby('A')['B'].nunique()
out = df[df['A'].isin(s[s>1].index)]
Output:
idx A B
1 1 a2 b1
2 2 a2 b2
3 3 a2 b1