I have a pandas DataFrame where I used groupby.ngroup() to identify groups of related data (basically duplicated data, but not exactly because that would have been too easy…).
| DisID | BunchData | GroupID |
|---|---|---|
| 1000 | xyz | 1 |
| 2012 | abc | 2 |
| 2014 | abc | 2 |
| 3000 | def | 3 |
I am trying to figure out how to remove the min "DisID" within a GroupID, only if there exists more than one row in a GroupID. In this case, the output would look like:
| DisID | BunchData | GroupID |
|---|---|---|
| 1000 | xyz | 1 |
| 2014 | abc | 2 |
| 3000 | def | 3 |
Thanks!
>Solution :
Let us do sort_values then drop_duplicates
df = df.sort_values('DisID').drop_duplicates(['GroupID'],keep='last')
Out[170]:
DisID BunchData GroupID
0 1000 xyz 1
2 2014 abc 2
3 3000 def 3