I wish to groupby ID and find the min and max counts for each month.
Data
DATE ID name
4/30/2023 AA hi
4/5/2023 AA hi
4/1/2023 AA hi
4/1/2023 AA hello
4/30/2023 AA hello
4/5/2023 AA hello
4/5/2023 AA hey
4/30/2023 AA hey
4/5/2023 AA ok
4/30/2023 AA ok
4/30/2023 AA ok
5/1/2023 AA ok
5/1/2023 AA hey
5/25/2023 AA hi
4/1/2023 BB hey
4/2/2023 BB hi
4/2/2023 BB hello
Desired
ID DATE stat count
AA 4/1/2023 min 2
AA 4/30/2023 max 5
AA 5/25/2023 min 1
AA 5/1/2023 max 2
BB 4/1/2023 min 1
BB 4/2/2023 max 2
Doing
result = df.groupby(['ID', 'DATE', 'name']).size().reset_index(name='count')
result['stat'] = result.groupby(['ID', 'DATE'])['count'].transform(lambda x: 'min' if x.idxmin() == x.idxmax() else 'max')
however this is not stating the dates. Any suggestion is appreciated.
>Solution :
This should do the trick (pun intended ;])
import pandas as pd
import numpy as np
data = [["4/30/2023", "AA", "hi"]] # Fill rest of data, this was for testing
df = pd.DataFrame(data, columns=['DATE', 'ID', 'name'])
# Convert 'DATE' column to datetime format and extract month and year
df['DATE'] = pd.to_datetime(df['DATE'])
df['month'] = df['DATE'].dt.month
df['year'] = df['DATE'].dt.year
# Group by 'ID', 'month', and 'year' and calculate the count of names
result = df.groupby(['ID', 'year', 'month', 'DATE'])['name'].size().reset_index(name='count')
# Find the min and max counts for each ID and month combination
result_min = result.groupby(['ID', 'year', 'month'])['count'].min().reset_index(name='min_count')
result_max = result.groupby(['ID', 'year', 'month'])['count'].max().reset_index(name='max_count')
# Merge the min and max counts with the original result DataFrame
result = result.merge(result_min, on=['ID', 'year', 'month']).merge(result_max, on=['ID', 'year', 'month'])
# Create a 'stat' column based on the min and max counts
result['stat'] = np.where(result['count'] == result['min_count'], 'min', 'max')
# Drop unnecessary columns and reset index
result = result.drop(columns=['min_count', 'max_count']).reset_index(drop=True)
# We use .to_string() so we can remove the pandas indexing on the left of the print
print(result[['ID', 'DATE', 'stat', 'count']].to_string(index=False))