Finding Duplicated Values in Pandas Groupby Object

October 12, 2023

I have a Pandas DataFrame:

msg_id	identifier
001	Stackoverflow
001	Stackoverflow
002	Stackoverflow
002	Cross-Validated

I want to drop the duplicated values in identifier for each unique value of msg_id

This is my current apporach which is super slow:

acc_df = pd.DataFrame(columns = df.columns)
for _, group in df.groupby("msg_id"):
    df = group[group.duplicated("identifier")]
    if len(df) > 0:
        acc_df = pd.concat([df, acc_df], axis=0, ignore_index=False)
acc_df

I have a very large dataset with 500 million rows. Even after filtering for the msg_id that has more than one identifier comes at the very large number.

I am looking for any vectorized or faster apporach NOT INCLUDING Multi-Processing and Threading

>Solution :

You can use vectorized operations in Pandas rather than using explicit loops, which should be faster than your current approach.

data = {
    'msg_id': ['001', '001', '002', '002'],
    'identifier': ['Stackoverflow', 'Stackoverflow', 'Stackoverflow', 'Cross-Validated']
}
df = pd.DataFrame(data)
df.sort_values(['msg_id', 'identifier'], inplace=True)
df['is_duplicated'] = df.duplicated(subset=['msg_id', 'identifier'], keep='first')
result = df[~df['is_duplicated']].drop(columns=['is_duplicated'])