Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Tricky max min groupby transformation using Pandas

I wish to groupby ID and find the min and max counts for each month.

Data

DATE        ID  name    
4/30/2023   AA  hi  
4/5/2023    AA  hi  
4/1/2023    AA  hi  
4/1/2023    AA  hello   
4/30/2023   AA  hello   
4/5/2023    AA  hello   
4/5/2023    AA  hey 
4/30/2023   AA  hey 
4/5/2023    AA  ok  
4/30/2023   AA  ok  
4/30/2023   AA  ok  
5/1/2023    AA  ok
5/1/2023    AA  hey
5/25/2023   AA  hi
4/1/2023    BB  hey 
4/2/2023    BB  hi  
4/2/2023    BB  hello   
        
        

Desired

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

ID  DATE        stat    count
AA  4/1/2023    min     2
AA  4/30/2023   max     5
AA  5/25/2023   min     1
AA  5/1/2023    max     2
BB  4/1/2023    min     1
BB  4/2/2023    max     2

Doing

result = df.groupby(['ID', 'DATE', 'name']).size().reset_index(name='count')
result['stat'] = result.groupby(['ID', 'DATE'])['count'].transform(lambda x: 'min' if x.idxmin() == x.idxmax() else 'max')

however this is not stating the dates. Any suggestion is appreciated.

>Solution :

This should do the trick (pun intended ;])

import pandas as pd
import numpy as np

data = [["4/30/2023", "AA", "hi"]] # Fill rest of data, this was for testing
df = pd.DataFrame(data, columns=['DATE', 'ID', 'name'])

# Convert 'DATE' column to datetime format and extract month and year
df['DATE'] = pd.to_datetime(df['DATE'])
df['month'] = df['DATE'].dt.month
df['year'] = df['DATE'].dt.year

# Group by 'ID', 'month', and 'year' and calculate the count of names
result = df.groupby(['ID', 'year', 'month', 'DATE'])['name'].size().reset_index(name='count')

# Find the min and max counts for each ID and month combination
result_min = result.groupby(['ID', 'year', 'month'])['count'].min().reset_index(name='min_count')
result_max = result.groupby(['ID', 'year', 'month'])['count'].max().reset_index(name='max_count')

# Merge the min and max counts with the original result DataFrame
result = result.merge(result_min, on=['ID', 'year', 'month']).merge(result_max, on=['ID', 'year', 'month'])

# Create a 'stat' column based on the min and max counts
result['stat'] = np.where(result['count'] == result['min_count'], 'min', 'max')

# Drop unnecessary columns and reset index
result = result.drop(columns=['min_count', 'max_count']).reset_index(drop=True)

# We use .to_string() so we can remove the pandas indexing on the left of the print
print(result[['ID', 'DATE', 'stat', 'count']].to_string(index=False))
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading