Home Removing Pandas duplicates with more complicated preference than first or last

Questions

Removing Pandas duplicates with more complicated preference than first or last

March 20, 2024

Pandas .drop_duplicates lets us specify that we want to keep the first or last (or none) of the duplicates found. I have a more complicated condition. Let’s say I have a set of preferred values for a column. If a duplicate pair is found and one is in the preferred set, I want to keep that one, regardless of whether its first or last. If both are preferred, usual drop_duplicates behavior should apply. If neither is in preferred set, then again, usual drop_duplicates behavior should apply.

I’ve tried playing with masks but can’t seem to get it right. I think it might be the wrong way to go about it. Here’s what I’ve tried.

import pandas as pd


def conditional_remove_duplicates(df, preferred_tags):

    duplicates_mask = df.duplicated(subset=['id', 'val'], keep=False)
    preferred_mask = df['tag'].isin(preferred_tags)
    mask = duplicates_mask | preferred_mask
    df = df[mask].drop_duplicates(subset=['id', 'val'], keep='first')

    return df


data = {'id': ['A', 'A', 'A', 'A', 'B', 'B', 'C', 'D', 'D'],
        'val': [10, 10, 11, 10, 20, 20, 30, 40, 40],
        'tag': ['X', 'Z', 'X', 'Y', 'Z', 'X', 'X', 'Z', 'Z']}
preferred_tags = {'X', 'Y'}

df = pd.DataFrame(data)
print(df)
"""
  id  val tag
0  A   10   X
1  A   10   Z
2  A   11   X
3  A   10   Y
4  B   20   Z
5  B   20   X
6  C   30   X
7  D   40   Z
8  D   40   Z
"""
result_df = conditional_remove_duplicates(df, preferred_tags)
print(result_df)
""" Produces:
  id  val tag
0  A   10   X
2  A   11   X
4  B   20   Z
6  C   30   X
7  D   40   Z

Should be:
   id  val tag
0  A   10   X
2  A   11   X
5  B   20   X
6  C   30   X
7  D   40   Z
"""

>Solution :

You should create a column that indicates the priority you want to give to duplicates and sort based on that. They key here is to use mergesort so it’s stable and doesn’t swap order for current ties.

Then you can drop duplicates.

df['rank'] = df['tag'].isin(preferred_tags)
df = df.sort_values(by='rank', ascending=False, kind='mergesort')
df.drop_duplicates(['id', 'val'])

 id  val tag   rank
0  A   10   X   True
2  A   11   X   True
5  B   20   X   True
6  C   30   X   True
7  D   40   Z  False