Python: Filling a dataframe sequentially based on matching ids (inefficient code)

Advertisements

Tables and code at the bottom will probs help much more than description.

I have a solution that works but think its very inefficient see bottom.

Problem:

I have two data frames df_1 and df_2 — these dataframes have a match column – match_id

df_2 has a date column that I am trying to write into df_1

every match_id in df_2 exists in df_1

I need date to be written sequentially into df_1 based in the match_id.

Importantly though I only want row of match_id in df_2 to be used once in the matching processes.

If there are not enough match_ids in df_2 to fill all the ids in df_1 then leave the remaining rows in df_1 blank.

It’s much easier if I show it:

df_1:

index match_id date
0 45
1 45
2 45
3 45
4 46
5 46
6 47

df_2:

index match_id date
0 45 01/01/22
1 45 02/01/22
2 46 02/01/22
3 46 05/01/22

Output (updated df_1):

index match_id date
0 45 01/01/22
1 45 02/01/22
2 45
3 45
4 46 02/01/22
5 46 05/01/22
6 47

I have a solution that works but I am certain there must be a much for time/resource efficient way of doing it (still really new to python and pretty new to coding) as in practice running it over much larger datasets:

import pandas as pd

data_1 = [[45, ""], [45, ""],[45, ""],[45, ""],[46, ""],[46, ""],[47, ""]]

df_1 = pd.DataFrame(data_1, columns = ['match_id', 'date'])

data_2 = [[45, "01/01/22"], [45, "02/01/22"],[46, "01/01/22"],[46, "05/01/22"]]

df_2 = pd.DataFrame(data_2, columns = ['match_id', 'date'])



for i_df_1, r_df_1 in df_1.iterrows():
    for i_df_2, r_df_2 in df_2.iterrows():
            if r_df_1["match_id"] == r_df_2["match_id"]:

            # Add data into the payment transaction dataframe
                df_1.loc[i_df_1,"date"] = r_df_2["date"]

            # Drop the used row from df_2 so does not get used again 
                df_2 = df_2.drop(i_df_2)

                break
            continue   

>Solution :

You can compute an extra key with groupby.cumcount and use it in merge:

df_3 = (df_1
 #.drop(columns='date') # uncomment if df1 already has an empty date column
 .merge(df_2,
        left_on=['match_id', df_1.groupby('match_id').cumcount()],
        right_on=['match_id', df_2.groupby('match_id').cumcount()],
        how='left'
       )
 #.drop(columns='key_1') # uncomment if unwanted
)

output:

   match_id  key_1      date
0        45      0  01/01/22
1        45      1  02/01/22
2        45      2       NaN
3        45      3       NaN
4        46      0  02/01/22
5        46      1  05/01/22
6        47      0       NaN

Leave a ReplyCancel reply