Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Python: Filling a dataframe sequentially based on matching ids (inefficient code)

Tables and code at the bottom will probs help much more than description.

I have a solution that works but think its very inefficient see bottom.

Problem:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

I have two data frames df_1 and df_2 — these dataframes have a match column – match_id

df_2 has a date column that I am trying to write into df_1

every match_id in df_2 exists in df_1

I need date to be written sequentially into df_1 based in the match_id.

Importantly though I only want row of match_id in df_2 to be used once in the matching processes.

If there are not enough match_ids in df_2 to fill all the ids in df_1 then leave the remaining rows in df_1 blank.

It’s much easier if I show it:

df_1:

index match_id date
0 45
1 45
2 45
3 45
4 46
5 46
6 47

df_2:

index match_id date
0 45 01/01/22
1 45 02/01/22
2 46 02/01/22
3 46 05/01/22

Output (updated df_1):

index match_id date
0 45 01/01/22
1 45 02/01/22
2 45
3 45
4 46 02/01/22
5 46 05/01/22
6 47

I have a solution that works but I am certain there must be a much for time/resource efficient way of doing it (still really new to python and pretty new to coding) as in practice running it over much larger datasets:

import pandas as pd

data_1 = [[45, ""], [45, ""],[45, ""],[45, ""],[46, ""],[46, ""],[47, ""]]

df_1 = pd.DataFrame(data_1, columns = ['match_id', 'date'])

data_2 = [[45, "01/01/22"], [45, "02/01/22"],[46, "01/01/22"],[46, "05/01/22"]]

df_2 = pd.DataFrame(data_2, columns = ['match_id', 'date'])



for i_df_1, r_df_1 in df_1.iterrows():
    for i_df_2, r_df_2 in df_2.iterrows():
            if r_df_1["match_id"] == r_df_2["match_id"]:

            # Add data into the payment transaction dataframe
                df_1.loc[i_df_1,"date"] = r_df_2["date"]

            # Drop the used row from df_2 so does not get used again 
                df_2 = df_2.drop(i_df_2)

                break
            continue   

>Solution :

You can compute an extra key with groupby.cumcount and use it in merge:

df_3 = (df_1
 #.drop(columns='date') # uncomment if df1 already has an empty date column
 .merge(df_2,
        left_on=['match_id', df_1.groupby('match_id').cumcount()],
        right_on=['match_id', df_2.groupby('match_id').cumcount()],
        how='left'
       )
 #.drop(columns='key_1') # uncomment if unwanted
)

output:

   match_id  key_1      date
0        45      0  01/01/22
1        45      1  02/01/22
2        45      2       NaN
3        45      3       NaN
4        46      0  02/01/22
5        46      1  05/01/22
6        47      0       NaN
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading