Advertisements

I have a Pandas dataframe, for example, like this:

idx | A | B |
---|---|---|

0 | a1 | b1 |

1 | a2 | b1 |

2 | a2 | b2 |

3 | a2 | b1 |

4 | a3 | b3 |

5 | a3 | b3 |

6 | a4 | b1 |

I want to find the duplicated values in Column A, but different values in Column B, and select all the indexes.

In above example, the results should be:

idx | A | B |
---|---|---|

1 | a2 | b1 |

2 | a2 | b2 |

3 | a2 | b1 |

- Drop idx 0 and 6, the values in Column A are unique.
- Drop idx 4 and 5, because the values in Column B are the same.
- I want to keep both idx 1 and 3 in the results, although they are the same, but they have a different value in idx 2 (not all the same).

How can I achieve this goal?

### >Solution :

You can use two `groupby.transform`

for boolean indexing:

```
g = df.groupby('A')['B']
# is A duplicated and are the duplicates non-unique?
out = df[g.transform('count').gt(1) & g.transform('nunique').gt(1)]
# the non-unique condition is however implying the duplication of A
# we can simplify to:
out = df[df.groupby('A')['B'].transform('nunique').gt(1)]
```

Or, with `isin`

:

```
s = df.groupby('A')['B'].nunique()
out = df[df['A'].isin(s[s>1].index)]
```

Output:

```
idx A B
1 1 a2 b1
2 2 a2 b2
3 3 a2 b1
```