Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Copy values to next rows based on multiple conditions without a for loop

I have a dataframe with 3 columns. For the most part, column C is empty. There is only a value for each instance when the value in column A is not equal to the value in column A on the previous row. As example:

A     B             C
123   10-11-2023    100
123   10-11-2023
123   19-11-2023
123   27-11-2023
456   08-12-2023    80
456   10-12-2023
456   14-12-2023
789   28-10-2023    130
789   28-10-2023
789   28-10-2023
789   03-11-2023

Now I want to copy the value in column C for to all rows where the value in column A and B equal the value in column A and B of the row where column C is not empty. Then the result should be:

A     B             C
123   10-11-2023    100
123   10-11-2023    100
123   19-11-2023
123   27-11-2023
456   08-12-2023    80
456   10-12-2023
456   14-12-2023
789   28-10-2023    130
789   28-10-2023    130
789   28-10-2023    130
789   03-11-2023

Currently I have this as a for loop:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

non_empty_C = df0['C'].dropna()

for index, value in non_empty_C.iteritems():
    match_A = df0.loc[index, 'A']
    match_B = df0.loc[index, 'B']
    mask_match = (df0['A'] == match_A) & (df0['B'] == match_B)
    df0.loc[mask_match, 'C'] = df0['C'].ffill()

This works, but as I have a dataframe of up to 1 million rows it’s much too slow. I tried to rewrite this to vectorized operations, but I don’t exactly know how to do this, since I have the complexity of 2 conditions and comparing values to previous rows. Any suggestions?

>Solution :

You can actually achieve this without using any loops. The key is to leverage the groupby and transform methods provided by pandas. These functions make it very easy to apply a function to each group of values, in your case, filling missing 'C' values with the first available value in each ('A', 'B') group.

import pandas as pd

data = {'A': [123, 123, 123, 123, 456, 456, 456, 789, 789, 789, 789],
        'B': ['10-11-2023', '10-11-2023', '19-11-2023', '27-11-2023', '08-12-2023', '10-12-2023', '14-12-2023', '28-10-2023', '28-10-2023', '28-10-2023', '03-11-2023'],
        'C': [100, None, None, None, 80, None, None, 130, None, None, None]}

df = pd.DataFrame(data)

# Vectorized operation to fill missing values in C based on the first value in each (A, B) group
df['C'] = df.groupby(['A', 'B'])['C'].transform('first')

print(df)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading