Removing duplicates based on a condition pandas

November 19, 2021

When removing duplicates, can I keep those rows that match a condition? Instead of doing:

df.remove_duplicates(subset=['x','y'], keep='first']

do:

df.remove_duplicates(subset=['x','y'], keep=df.loc[df[column]=='String'])

Suppose I have a df like:

A  B

1  'Hi'
1  'Bye'

Keep the rows with ‘Hi’. I want to do it this way because it would be more handful since I am going to introduce multiple conditions in the process

>Solution :

Use DataFrame.duplicated with invert mask and chain by & for bitwise AND by condition:

df['mask'] = ~df.duplicated(subset=['A','B']) & (df['B']=='Hi')
print (df)
   A    B   mask
0  1   Hi   True
1  1  Bye  False
2  1   Hi  False
3  1  Bye  False

Tested with duplciated index and working perfectly:

df.index = [0] * 4

df['mask'] = ~df.duplicated(subset=['A','B']) & (df['B']=='Hi')
print (df)
   A    B   mask
0  1   Hi   True
0  1  Bye  False
0  1   Hi  False
0  1  Bye  False