Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How can I efficiently optimize the creation of conditional columns based on multiple columns in Pandas DataFrames?

I have a dataframe with over 20 thousand rows and need to create a column based on more than 10 conditions from four other columns.

Instead of writing multiple lines of .loc, I decided to create a function. To enhance the performance and readability of this function, I opted to group the necessary columns into named_tuples using the pandas itertuples method and encapsulate it within the functools cache function, significantly improving the performance of the created function, albeit at the cost of initially grouping the columns into tuples. Here’s a simplified sample of what I did:

import pandas as pd
from functools import cache

transform_counts = 0
@cache
def transform(row):
  global transform_counts
  transform_counts += 1
  if row.A == 1 and row.B == 23:
    return 33
  else:
    return 40

df = pd.DataFrame({'A': [1, 2, 3, 4, 1, 2], 'B': [23, 43, 89, 50, 23, 43]})
df['C'] = list(df[['A', 'B']].itertuples(index=False))
df['C'] = df['C'].map(transform)

In this sample the transform_counts was 4 and the dataframe looked like this:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel


    A   B   C
0   1   23  33
1   2   43  40
2   3   89  40
3   4   50  40
4   1   23  33
5   2   43  40

Is there any way to further optimize the code, either by reducing the transform_counts or by skipping the need to group the necessary columns into tuples?

>Solution :

Assuming your transform function is too complicated to use the methods given in this question, you could just use apply with a wrapper function to your cached function. This would save the itertuples loop. For example:

import pandas as pd
from functools import cache

transform_counts = 0
@cache
def transform(a, b):
  global transform_counts
  transform_counts += 1
  if a == 1 and b == 23:
    return 33
  else:
    return 40

df = pd.DataFrame({'A': [1, 2, 3, 4, 1, 2], 'B': [23, 43, 89, 50, 23, 43]})
df['C'] = df.apply(lambda row:transform(row.A, row.B), axis=1)
print(df, transform_counts, sep='\n\n')

Output:

   A   B   C
0  1  23  33
1  2  43  40
2  3  89  40
3  4  50  40
4  1  23  33
5  2  43  40

4
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading