I have a dataframe with over 20 thousand rows and need to create a column based on more than 10 conditions from four other columns.
Instead of writing multiple lines of .loc, I decided to create a function. To enhance the performance and readability of this function, I opted to group the necessary columns into named_tuples using the pandas itertuples method and encapsulate it within the functools cache function, significantly improving the performance of the created function, albeit at the cost of initially grouping the columns into tuples. Here’s a simplified sample of what I did:
import pandas as pd
from functools import cache
transform_counts = 0
@cache
def transform(row):
global transform_counts
transform_counts += 1
if row.A == 1 and row.B == 23:
return 33
else:
return 40
df = pd.DataFrame({'A': [1, 2, 3, 4, 1, 2], 'B': [23, 43, 89, 50, 23, 43]})
df['C'] = list(df[['A', 'B']].itertuples(index=False))
df['C'] = df['C'].map(transform)
In this sample the transform_counts was 4 and the dataframe looked like this:
A B C
0 1 23 33
1 2 43 40
2 3 89 40
3 4 50 40
4 1 23 33
5 2 43 40
Is there any way to further optimize the code, either by reducing the transform_counts or by skipping the need to group the necessary columns into tuples?
>Solution :
Assuming your transform function is too complicated to use the methods given in this question, you could just use apply with a wrapper function to your cached function. This would save the itertuples loop. For example:
import pandas as pd
from functools import cache
transform_counts = 0
@cache
def transform(a, b):
global transform_counts
transform_counts += 1
if a == 1 and b == 23:
return 33
else:
return 40
df = pd.DataFrame({'A': [1, 2, 3, 4, 1, 2], 'B': [23, 43, 89, 50, 23, 43]})
df['C'] = df.apply(lambda row:transform(row.A, row.B), axis=1)
print(df, transform_counts, sep='\n\n')
Output:
A B C
0 1 23 33
1 2 43 40
2 3 89 40
3 4 50 40
4 1 23 33
5 2 43 40
4