I have a dataframe with 3 columns. For the most part, column C is empty. There is only a value for each instance when the value in column A is not equal to the value in column A on the previous row. As example:
A B C
123 10-11-2023 100
123 10-11-2023
123 19-11-2023
123 27-11-2023
456 08-12-2023 80
456 10-12-2023
456 14-12-2023
789 28-10-2023 130
789 28-10-2023
789 28-10-2023
789 03-11-2023
Now I want to copy the value in column C for to all rows where the value in column A and B equal the value in column A and B of the row where column C is not empty. Then the result should be:
A B C
123 10-11-2023 100
123 10-11-2023 100
123 19-11-2023
123 27-11-2023
456 08-12-2023 80
456 10-12-2023
456 14-12-2023
789 28-10-2023 130
789 28-10-2023 130
789 28-10-2023 130
789 03-11-2023
Currently I have this as a for loop:
non_empty_C = df0['C'].dropna()
for index, value in non_empty_C.iteritems():
match_A = df0.loc[index, 'A']
match_B = df0.loc[index, 'B']
mask_match = (df0['A'] == match_A) & (df0['B'] == match_B)
df0.loc[mask_match, 'C'] = df0['C'].ffill()
This works, but as I have a dataframe of up to 1 million rows it’s much too slow. I tried to rewrite this to vectorized operations, but I don’t exactly know how to do this, since I have the complexity of 2 conditions and comparing values to previous rows. Any suggestions?
>Solution :
You can actually achieve this without using any loops. The key is to leverage the groupby and transform methods provided by pandas. These functions make it very easy to apply a function to each group of values, in your case, filling missing 'C' values with the first available value in each ('A', 'B') group.
import pandas as pd
data = {'A': [123, 123, 123, 123, 456, 456, 456, 789, 789, 789, 789],
'B': ['10-11-2023', '10-11-2023', '19-11-2023', '27-11-2023', '08-12-2023', '10-12-2023', '14-12-2023', '28-10-2023', '28-10-2023', '28-10-2023', '03-11-2023'],
'C': [100, None, None, None, 80, None, None, 130, None, None, None]}
df = pd.DataFrame(data)
# Vectorized operation to fill missing values in C based on the first value in each (A, B) group
df['C'] = df.groupby(['A', 'B'])['C'].transform('first')
print(df)