Advertisements
I have a Pandas dataframe, for example, like this:
idx | A | B |
---|---|---|
0 | a1 | b1 |
1 | a2 | b1 |
2 | a2 | b2 |
3 | a2 | b1 |
4 | a3 | b3 |
5 | a3 | b3 |
6 | a4 | b1 |
I want to find the duplicated values in Column A, but different values in Column B, and select all the indexes.
In above example, the results should be:
idx | A | B |
---|---|---|
1 | a2 | b1 |
2 | a2 | b2 |
3 | a2 | b1 |
- Drop idx 0 and 6, the values in Column A are unique.
- Drop idx 4 and 5, because the values in Column B are the same.
- I want to keep both idx 1 and 3 in the results, although they are the same, but they have a different value in idx 2 (not all the same).
How can I achieve this goal?
>Solution :
You can use two groupby.transform
for boolean indexing:
g = df.groupby('A')['B']
# is A duplicated and are the duplicates non-unique?
out = df[g.transform('count').gt(1) & g.transform('nunique').gt(1)]
# the non-unique condition is however implying the duplication of A
# we can simplify to:
out = df[df.groupby('A')['B'].transform('nunique').gt(1)]
Or, with isin
:
s = df.groupby('A')['B'].nunique()
out = df[df['A'].isin(s[s>1].index)]
Output:
idx A B
1 1 a2 b1
2 2 a2 b2
3 3 a2 b1