Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Compute avg gap between dates and max date for each group using pandas

I have a dataframe like as shown below

sub_id,teacher,div,pid,pos_date
 1,ABC,SCIENCE,A1,12/10/2021
 1,ABC,SCIENCE,A1,22/06/2019
 1,ABC,SCIENCE,A1,12/12/2018
 1,ABC,SCIENCE,A1,27/11/2020
 1,DEF,CHEMISTRY,A1,12/10/2021
 1,DEF,CHEMISTRY,A2,11/11/2018
 1,DEF,CHEMISTRY,A2,12/10/2021
 1,ABC,SCIENCE,A2,12/10/2019
 1,ABC,SCIENCE,A2,12/10/2020
 1,ABC,SCIENCE,A3,12/11/2021
 1,ABC,SCIENCE,A3,22/03/2022
 1,ABC,SCIENCE,A4,22/10/2021
 1,ABC,SCIENCE,A4,12/04/2021

 df = pd.read_clipboard()

I would like to do the below

a) Group by sub_id,teacher,div and pid

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

b) For each group, compute the below

       1)  Max(pos_date)
       2) Average gap between each pos_date
       3) Median gap between each pos_date
       4) No of records (count)

So, I tried the below

df.set_index(['sub_id','teacher','div','pid']).groupby(['sub_id','teacher','div','pid'])['pos_date'].max()
df.set_index(['sub_id','teacher','div','pid']).groupby(['sub_id','teacher','div','pid'])['pos_date'].average()
df.set_index(['sub_id','teacher','div','pid']).groupby(['sub_id','teacher','div','pid'])['pos_date'].median()
df.set_index(['sub_id','teacher','div','pid']).groupby(['sub_id','teacher','div','pid'])['pos_date'].size()

but the above is neither elegant nor efficient.

Is there any better way to write the above code?

I expect my output to be like in below format

enter image description here

>Solution :

Use groupby and agg:

# Ensure pos_date is DatetimeIndex
df['pos_date'] = pd.to_datetime(df['pos_date'], dayfirst=True)

out = (df.groupby(['sub_id','teacher','div','pid'])['pos_date']
         .agg(['max', 'mean', 'median', 'size']))

Output:

>>> out
                                    max                mean              median  size
sub_id teacher div       pid                                                         
1      ABC     SCIENCE   A1  2021-10-12 2020-04-10 18:00:00 2020-03-10 00:00:00     4
                         A2  2020-10-12 2020-04-12 00:00:00 2020-04-12 00:00:00     2
                         A3  2022-03-22 2022-01-16 00:00:00 2022-01-16 00:00:00     2
                         A4  2021-10-22 2021-07-17 12:00:00 2021-07-17 12:00:00     2
       DEF     CHEMISTRY A1  2021-10-12 2021-10-12 00:00:00 2021-10-12 00:00:00     1
                         A2  2021-10-12 2020-04-27 00:00:00 2020-04-27 00:00:00     2

Update

out = (df.groupby(['sub_id','teacher','div','pid'])['pos_date']
         .agg(**{'Max': 'max', 'Mean': lambda x: x.diff().mean().days,
                 'Median': lambda x: x.diff().median(), 'Size': 'size'}))

Output:

>>> out
                                    Max    Mean    Median  Size
sub_id teacher div       pid                                   
1      ABC     SCIENCE   A1  2021-10-12  -107.0 -192 days     4
                         A2  2020-10-12   366.0  366 days     2
                         A3  2022-03-22   130.0  130 days     2
                         A4  2021-10-22  -193.0 -193 days     2
       DEF     CHEMISTRY A1  2021-10-12     NaN       NaT     1
                         A2  2021-10-12  1066.0 1066 days     2
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading