Follow

Follow

Contact

Home Compute avg gap between dates and max date for each group using pandas

Questions

Compute avg gap between dates and max date for each group using pandas

byMR

July 3, 2022

I have a dataframe like as shown below

sub_id,teacher,div,pid,pos_date
 1,ABC,SCIENCE,A1,12/10/2021
 1,ABC,SCIENCE,A1,22/06/2019
 1,ABC,SCIENCE,A1,12/12/2018
 1,ABC,SCIENCE,A1,27/11/2020
 1,DEF,CHEMISTRY,A1,12/10/2021
 1,DEF,CHEMISTRY,A2,11/11/2018
 1,DEF,CHEMISTRY,A2,12/10/2021
 1,ABC,SCIENCE,A2,12/10/2019
 1,ABC,SCIENCE,A2,12/10/2020
 1,ABC,SCIENCE,A3,12/11/2021
 1,ABC,SCIENCE,A3,22/03/2022
 1,ABC,SCIENCE,A4,22/10/2021
 1,ABC,SCIENCE,A4,12/04/2021

 df = pd.read_clipboard()

I would like to do the below

a) Group by sub_id,teacher,div and pid

b) For each group, compute the below

       1)  Max(pos_date)
       2) Average gap between each pos_date
       3) Median gap between each pos_date
       4) No of records (count)

So, I tried the below

df.set_index(['sub_id','teacher','div','pid']).groupby(['sub_id','teacher','div','pid'])['pos_date'].max()
df.set_index(['sub_id','teacher','div','pid']).groupby(['sub_id','teacher','div','pid'])['pos_date'].average()
df.set_index(['sub_id','teacher','div','pid']).groupby(['sub_id','teacher','div','pid'])['pos_date'].median()
df.set_index(['sub_id','teacher','div','pid']).groupby(['sub_id','teacher','div','pid'])['pos_date'].size()

but the above is neither elegant nor efficient.

Is there any better way to write the above code?

I expect my output to be like in below format

>Solution :

Use groupby and agg:

# Ensure pos_date is DatetimeIndex
df['pos_date'] = pd.to_datetime(df['pos_date'], dayfirst=True)

out = (df.groupby(['sub_id','teacher','div','pid'])['pos_date']
         .agg(['max', 'mean', 'median', 'size']))

Output:

>>> out
                                    max                mean              median  size
sub_id teacher div       pid                                                         
1      ABC     SCIENCE   A1  2021-10-12 2020-04-10 18:00:00 2020-03-10 00:00:00     4
                         A2  2020-10-12 2020-04-12 00:00:00 2020-04-12 00:00:00     2
                         A3  2022-03-22 2022-01-16 00:00:00 2022-01-16 00:00:00     2
                         A4  2021-10-22 2021-07-17 12:00:00 2021-07-17 12:00:00     2
       DEF     CHEMISTRY A1  2021-10-12 2021-10-12 00:00:00 2021-10-12 00:00:00     1
                         A2  2021-10-12 2020-04-27 00:00:00 2020-04-27 00:00:00     2

Update

out = (df.groupby(['sub_id','teacher','div','pid'])['pos_date']
         .agg(**{'Max': 'max', 'Mean': lambda x: x.diff().mean().days,
                 'Median': lambda x: x.diff().median(), 'Size': 'size'}))

Output:

>>> out
                                    Max    Mean    Median  Size
sub_id teacher div       pid                                   
1      ABC     SCIENCE   A1  2021-10-12  -107.0 -192 days     4
                         A2  2020-10-12   366.0  366 days     2
                         A3  2022-03-22   130.0  130 days     2
                         A4  2021-10-22  -193.0 -193 days     2
       DEF     CHEMISTRY A1  2021-10-12     NaN       NaT     1
                         A2  2021-10-12  1066.0 1066 days     2

pandas-groupby

byMR

Published July 03, 2022

Add a comment

Leave a ReplyCancel reply

Read more

Questions

The cause of resubmit form in browser

byMR

July 3, 2022

Questions

Input values are coming out as undefined?

byMR

July 3, 2022

Questions

Change text in div while css does animation

byMR

July 3, 2022

Questions

Error when using react context – Cannot destructure property

byMR

July 3, 2022

Questions

How to get an average mean of columns Flask/Pandas

byMR

July 3, 2022

Questions

Pre-commit Docker: .git folder removed after docker-compose build

byMR

July 3, 2022