I am a long time SAS/SQL user and have always defaulted to using SQL for my groupbys
for example to do
select region
,case when age < 5 then 'Low'
when age >= 5 and age <= 10 then 'Middle'
else 'High' as duration
,sum(1) as total
,sum(profit) as profit
,sum(profit)/sum(1) as avg_profit
,max(revenue) as max revenue
from table
where region not in ('A')
group by
region,(case when age < 5 then 'Low'
when age >= 5 and age <= 10 then 'Middle'
else 'High)
I am trying to recreate the above in Pandas but I don’t know how to write it in as little code as possible as the above
Can anyone suggest an efficient way to write this in Pandas that doesn’t involve 5 merges and creating new columns beforehand?
>Solution :
You can use aggregation functions along with groupby and np.where. try this:
import pandas as pd
import numpy as np
# Assuming you have a DataFrame named 'df' with columns 'region', 'age', 'profit', and 'revenue'
condition1 = df['age'] < 5
condition2 = (df['age'] >= 5) & (df['age'] <= 10)
# Create the 'duration' column using np.where
df['duration'] = np.where(condition1, 'Low',
np.where(condition2, 'Middle', 'High'))
# Filter out the 'A' region
df = df[df['region'] != 'A']
# Group by 'region' and 'duration', and apply aggregations
result = df.groupby(['region', 'duration']).agg(
total=('age', 'count'),
profit=('profit', 'sum'),
avg_profit=('profit', 'mean'),
max_revenue=('revenue', 'max')
)
print(result)