Merging two data frames based on a common column with repeated values

I have 2 data frames with different lengths:
len(df1) 10104
len(df2) 15560

I want to merge these based on a common column (taskID) on both data frames. The Task ID has repeated IDs each ID represent an item belongs to the same task.

example:

df1:

TaskID task duration
task 45 2 mins
task 45 5 mins
task 45 7 mins
task 67 8 mins
task 67 9 mins

df2:

TaskID gender
task 45 male
task 45 male
task 45 male
task 67 female
task 67 female

when I use Pandas merge function I get an output with a length of 471437 which is greater than the length of both data frames.

I think this is because of the repeated values of TaskID, and I can’t remove the duplicates as each row represent different item.

I tried:

merged=pd.merge(df1,df2, on=’TaskID’, how=’inner’)

I get an output with a length of 471437

Can you please help with this issue ?

>Solution :

You can use groupby.cumcount to deduplicate the TaskIDs and merge in order:

df1.merge(df2, left_on=['TaskID', df1.groupby('TaskID').cumcount()],
               right_on=['TaskID', df2.groupby('TaskID').cumcount()])

Output:

    TaskID  key_1 task duration  gender
0  task 45      0        2 mins    male
1  task 45      1        5 mins    male
2  task 45      2        7 mins    male
3  task 67      0        8 mins  female
4  task 67      1        9 mins  female

If the gender is unique per TaskID, you can update in place with:

df1['gender'] = df1['TaskID'].map(df2.drop_duplicates('TaskID').set_index('TaskID')['gender'])

Leave a Reply