Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Finding Duplicated Values in Pandas Groupby Object

I have a Pandas DataFrame:

msg_id identifier
001 Stackoverflow
001 Stackoverflow
002 Stackoverflow
002 Cross-Validated

I want to drop the duplicated values in identifier for each unique value of msg_id

This is my current apporach which is super slow:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

acc_df = pd.DataFrame(columns = df.columns)
for _, group in df.groupby("msg_id"):
    df = group[group.duplicated("identifier")]
    if len(df) > 0:
        acc_df = pd.concat([df, acc_df], axis=0, ignore_index=False)
acc_df

I have a very large dataset with 500 million rows. Even after filtering for the msg_id that has more than one identifier comes at the very large number.

I am looking for any vectorized or faster apporach NOT INCLUDING Multi-Processing and Threading

>Solution :

You can use vectorized operations in Pandas rather than using explicit loops, which should be faster than your current approach.

data = {
    'msg_id': ['001', '001', '002', '002'],
    'identifier': ['Stackoverflow', 'Stackoverflow', 'Stackoverflow', 'Cross-Validated']
}
df = pd.DataFrame(data)
df.sort_values(['msg_id', 'identifier'], inplace=True)
df['is_duplicated'] = df.duplicated(subset=['msg_id', 'identifier'], keep='first')
result = df[~df['is_duplicated']].drop(columns=['is_duplicated'])
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading