Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Pandas column lost after getting mean value of time duration

I have a df ,you can have it by copy and run the following code:

import pandas as pd
from io import StringIO

df = """
 b_id          duration1                  duration2                          user  cases
 
 366           NaN                        38 days 22:05:06.807430            Test  4
 367           0 days 00:00:05.285239     NaN                                Test  4
 368           NaN                        NaN                                Test  4
 366           NaN                        38 days 22:05:06.807430            Test  4
 
 466           NaN                        38 days 22:05:06.807430            Tom   3
 467           0 days 00:00:05.285239     NaN                                Tom   3
 467           0 days 00:00:05.285239     NaN                                Tom   3


"""
df= pd.read_csv(StringIO(df.strip()), sep='\s\s+', engine='python')
df

Then I use the following code to get the mean value of duration1 and duration2 :

out = (df
   .set_index('user')
   .filter(like='duration')
   .apply(pd.to_timedelta)
   .groupby(level=0).mean()
   .reset_index()
 )

Output:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

    user    duration1   duration2
0   Test    0 days 00:00:05.285239  38 days 22:05:06.807430
1   Tom 0 days 00:00:05.285239  38 days 22:05:06.807430

My question is how to keep the column ‘cases’ in the output.The correct output should looks like:

    user    duration1   duration2                               cases
0   Test    0 days 00:00:05.285239  38 days 22:05:06.807430     4
1   Tom 0 days 00:00:05.285239  38 days 22:05:06.807430         3

>Solution :

The most straightforward in my opinion is to use both columns as index/grouper. This way you still benefit from the easy conversion to timedelta.

out = (df
   .set_index(['user', 'cases'])
   .filter(like='duration')
   .apply(pd.to_timedelta)
   .groupby(level=[0,1]).mean()
   .reset_index()
 )

Output:

   user  cases              duration1               duration2
0  Test      4 0 days 00:00:05.285239 38 days 22:05:06.807430
1   Tom      3 0 days 00:00:05.285239 38 days 22:05:06.807430
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading