I have 2 data frames with different lengths:
len(df1) 10104
len(df2) 15560
I want to merge these based on a common column (taskID) on both data frames. The Task ID has repeated IDs each ID represent an item belongs to the same task.
example:
df1:
| TaskID | task duration |
|---|---|
| task 45 | 2 mins |
| task 45 | 5 mins |
| task 45 | 7 mins |
| task 67 | 8 mins |
| task 67 | 9 mins |
df2:
| TaskID | gender |
|---|---|
| task 45 | male |
| task 45 | male |
| task 45 | male |
| task 67 | female |
| task 67 | female |
when I use Pandas merge function I get an output with a length of 471437 which is greater than the length of both data frames.
I think this is because of the repeated values of TaskID, and I can’t remove the duplicates as each row represent different item.
I tried:
merged=pd.merge(df1,df2, on=’TaskID’, how=’inner’)
I get an output with a length of 471437
Can you please help with this issue ?
>Solution :
You can use groupby.cumcount to deduplicate the TaskIDs and merge in order:
df1.merge(df2, left_on=['TaskID', df1.groupby('TaskID').cumcount()],
right_on=['TaskID', df2.groupby('TaskID').cumcount()])
Output:
TaskID key_1 task duration gender
0 task 45 0 2 mins male
1 task 45 1 5 mins male
2 task 45 2 7 mins male
3 task 67 0 8 mins female
4 task 67 1 9 mins female
If the gender is unique per TaskID, you can update in place with:
df1['gender'] = df1['TaskID'].map(df2.drop_duplicates('TaskID').set_index('TaskID')['gender'])