Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Removing Pandas duplicates with more complicated preference than first or last

Pandas .drop_duplicates lets us specify that we want to keep the first or last (or none) of the duplicates found. I have a more complicated condition. Let’s say I have a set of preferred values for a column. If a duplicate pair is found and one is in the preferred set, I want to keep that one, regardless of whether its first or last. If both are preferred, usual drop_duplicates behavior should apply. If neither is in preferred set, then again, usual drop_duplicates behavior should apply.

I’ve tried playing with masks but can’t seem to get it right. I think it might be the wrong way to go about it. Here’s what I’ve tried.

import pandas as pd


def conditional_remove_duplicates(df, preferred_tags):

    duplicates_mask = df.duplicated(subset=['id', 'val'], keep=False)
    preferred_mask = df['tag'].isin(preferred_tags)
    mask = duplicates_mask | preferred_mask
    df = df[mask].drop_duplicates(subset=['id', 'val'], keep='first')

    return df


data = {'id': ['A', 'A', 'A', 'A', 'B', 'B', 'C', 'D', 'D'],
        'val': [10, 10, 11, 10, 20, 20, 30, 40, 40],
        'tag': ['X', 'Z', 'X', 'Y', 'Z', 'X', 'X', 'Z', 'Z']}
preferred_tags = {'X', 'Y'}

df = pd.DataFrame(data)
print(df)
"""
  id  val tag
0  A   10   X
1  A   10   Z
2  A   11   X
3  A   10   Y
4  B   20   Z
5  B   20   X
6  C   30   X
7  D   40   Z
8  D   40   Z
"""
result_df = conditional_remove_duplicates(df, preferred_tags)
print(result_df)
""" Produces:
  id  val tag
0  A   10   X
2  A   11   X
4  B   20   Z
6  C   30   X
7  D   40   Z

Should be:
   id  val tag
0  A   10   X
2  A   11   X
5  B   20   X
6  C   30   X
7  D   40   Z
"""

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

You should create a column that indicates the priority you want to give to duplicates and sort based on that. They key here is to use mergesort so it’s stable and doesn’t swap order for current ties.

Then you can drop duplicates.

df['rank'] = df['tag'].isin(preferred_tags)
df = df.sort_values(by='rank', ascending=False, kind='mergesort')
df.drop_duplicates(['id', 'val'])

 id  val tag   rank
0  A   10   X   True
2  A   11   X   True
5  B   20   X   True
6  C   30   X   True
7  D   40   Z  False
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading