Easy way to do group by with multiple output in pandas

May 24, 2024

I am a long time SAS/SQL user and have always defaulted to using SQL for my groupbys

for example to do

select region
,case when age < 5 then 'Low'
when age >= 5 and age <= 10 then 'Middle'
else 'High' as duration
,sum(1) as total
,sum(profit) as profit
,sum(profit)/sum(1)  as avg_profit
,max(revenue) as max revenue
from table
where region not in ('A')
group by 
region,(case when age < 5 then 'Low'
when age >= 5 and age <= 10 then 'Middle'
else 'High)

I am trying to recreate the above in Pandas but I don’t know how to write it in as little code as possible as the above
Can anyone suggest an efficient way to write this in Pandas that doesn’t involve 5 merges and creating new columns beforehand?

>Solution :

You can use aggregation functions along with groupby and np.where. try this:

import pandas as pd
import numpy as np

# Assuming you have a DataFrame named 'df' with columns 'region', 'age', 'profit', and 'revenue'
condition1 = df['age'] < 5
condition2 = (df['age'] >= 5) & (df['age'] <= 10)

# Create the 'duration' column using np.where
df['duration'] = np.where(condition1, 'Low',
                          np.where(condition2, 'Middle', 'High'))

# Filter out the 'A' region
df = df[df['region'] != 'A']

# Group by 'region' and 'duration', and apply aggregations
result = df.groupby(['region', 'duration']).agg(
    total=('age', 'count'),
    profit=('profit', 'sum'),
    avg_profit=('profit', 'mean'),
    max_revenue=('revenue', 'max')
)

print(result)