I have 2 data frames with different lengths:
I want to merge these based on a common column (taskID) on both data frames. The Task ID has repeated IDs each ID represent an item belongs to the same task.
|task 45||2 mins|
|task 45||5 mins|
|task 45||7 mins|
|task 67||8 mins|
|task 67||9 mins|
when I use Pandas merge function I get an output with a length of 471437 which is greater than the length of both data frames.
I think this is because of the repeated values of TaskID, and I can’t remove the duplicates as each row represent different item.
merged=pd.merge(df1,df2, on=’TaskID’, how=’inner’)
I get an output with a length of 471437
Can you please help with this issue ?
You can use
groupby.cumcount to deduplicate the TaskIDs and merge in order:
df1.merge(df2, left_on=['TaskID', df1.groupby('TaskID').cumcount()], right_on=['TaskID', df2.groupby('TaskID').cumcount()])
TaskID key_1 task duration gender 0 task 45 0 2 mins male 1 task 45 1 5 mins male 2 task 45 2 7 mins male 3 task 67 0 8 mins female 4 task 67 1 9 mins female
If the gender is unique per TaskID, you can update in place with:
df1['gender'] = df1['TaskID'].map(df2.drop_duplicates('TaskID').set_index('TaskID')['gender'])