I have a Pandas DataFrame:
| msg_id | identifier |
|---|---|
| 001 | Stackoverflow |
| 001 | Stackoverflow |
| 002 | Stackoverflow |
| 002 | Cross-Validated |
I want to drop the duplicated values in identifier for each unique value of msg_id
This is my current apporach which is super slow:
acc_df = pd.DataFrame(columns = df.columns)
for _, group in df.groupby("msg_id"):
df = group[group.duplicated("identifier")]
if len(df) > 0:
acc_df = pd.concat([df, acc_df], axis=0, ignore_index=False)
acc_df
I have a very large dataset with 500 million rows. Even after filtering for the msg_id that has more than one identifier comes at the very large number.
I am looking for any vectorized or faster apporach NOT INCLUDING Multi-Processing and Threading
>Solution :
You can use vectorized operations in Pandas rather than using explicit loops, which should be faster than your current approach.
data = {
'msg_id': ['001', '001', '002', '002'],
'identifier': ['Stackoverflow', 'Stackoverflow', 'Stackoverflow', 'Cross-Validated']
}
df = pd.DataFrame(data)
df.sort_values(['msg_id', 'identifier'], inplace=True)
df['is_duplicated'] = df.duplicated(subset=['msg_id', 'identifier'], keep='first')
result = df[~df['is_duplicated']].drop(columns=['is_duplicated'])