Removing duplicates based on matching column values with boolean indexing

March 29, 2022

After merging two DF’s I have the following dataset:

DB_ID	x_val	y_val
x01	405	407
x01	405	405
x02	308	306
x02	308	308
x03	658	658
x03	658	660
x04	None	658
x04	None	660
x05	658	660
x06	660	660

The y table contains multiple values for the left join variable (not included in table), resulting in multiple rows per unique DB_ID (string variable, not in df index).

The issue is that only one row is correct, where x_val and y_val match. I tried removing the duplicates with the following code:
df= df[~df['DB_ID'].duplicated() | combined['x_val'] != combined['y_val']]

This however doesn’t work. I am looking for a solution to achieve the following result:

DB_ID	x_val	y_val
x01	405	405
x02	308	308
x03	658	658
x04	None	658
x05	658	660
x06	660	660

>Solution :

Idea is compare both column for not equal, then sorting and reove duplicates by DB_ID:

df = (df.assign(new = df['x_val'].ne(df['y_val']))
        .sort_values(['DB_ID','new'])
        .drop_duplicates('DB_ID')
        .drop('new', axis=1))
print (df)
  DB_ID x_val y_val
1   x01   405   405
3   x02   308   308
4   x03   658   658
6   x04  None   658
8   x05   658   660
9   x06   660   660

If need equal NaNs or Nones use:

df = (df.assign(new = df['x_val'].fillna('same').ne(df['y_val'].fillna('same')))
        .sort_values(['DB_ID','new'])
        .drop_duplicates('DB_ID')
        .drop('new', axis=1))