After merging two DF’s I have the following dataset:
| DB_ID | x_val | y_val |
|---|---|---|
| x01 | 405 | 407 |
| x01 | 405 | 405 |
| x02 | 308 | 306 |
| x02 | 308 | 308 |
| x03 | 658 | 658 |
| x03 | 658 | 660 |
| x04 | None | 658 |
| x04 | None | 660 |
| x05 | 658 | 660 |
| x06 | 660 | 660 |
The y table contains multiple values for the left join variable (not included in table), resulting in multiple rows per unique DB_ID (string variable, not in df index).
The issue is that only one row is correct, where x_val and y_val match. I tried removing the duplicates with the following code:
df= df[~df['DB_ID'].duplicated() | combined['x_val'] != combined['y_val']]
This however doesn’t work. I am looking for a solution to achieve the following result:
| DB_ID | x_val | y_val |
|---|---|---|
| x01 | 405 | 405 |
| x02 | 308 | 308 |
| x03 | 658 | 658 |
| x04 | None | 658 |
| x05 | 658 | 660 |
| x06 | 660 | 660 |
>Solution :
Idea is compare both column for not equal, then sorting and reove duplicates by DB_ID:
df = (df.assign(new = df['x_val'].ne(df['y_val']))
.sort_values(['DB_ID','new'])
.drop_duplicates('DB_ID')
.drop('new', axis=1))
print (df)
DB_ID x_val y_val
1 x01 405 405
3 x02 308 308
4 x03 658 658
6 x04 None 658
8 x05 658 660
9 x06 660 660
If need equal NaNs or Nones use:
df = (df.assign(new = df['x_val'].fillna('same').ne(df['y_val'].fillna('same')))
.sort_values(['DB_ID','new'])
.drop_duplicates('DB_ID')
.drop('new', axis=1))