Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Calculating the differences between unique values in the column and averaging by number of unique values with grouping by another column in Python

I have a dataframe and need to do some custom calculations.

Protocol    Visit   Day
111 1   1
111 1   1
111 1   1
111 2   15
111 2   15
111 2   15
111 3   29
111 3   29
222 2   14
222 2   14
222 2   14
222 3   28
222 3   28
222 3   28

Reproducible input:

{'Protocol ': {0: 111, 1: 111, 2: 111, 3: 111, 4: 111, 5: 111, 6: 111, 7: 111, 8: 222, 9: 222, 10: 222, 11: 222, 12: 222, 13: 222, 14: 222}, 'Visit': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 2, 6: 3, 7: 3, 8: 1, 9: 2, 10: 2, 11: 2, 12: 3, 13: 3, 14: 3}, 'Day': {0: 1, 1: 1, 2: 1, 3: 15, 4: 15, 5: 15, 6: 29, 7: 29, 8: 1, 9: 14, 10: 14, 11: 14, 12: 28, 13: 28, 14: 28}}

What do I want:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

  • Calculate the difference between unique values between values in Day column in a certain way.

For instance, for protocol 111 it will be 15-1 = 14 and 29 - 15 = 14. The same logic should be apply for all numbers in the column for a specific protocol, because I will have more for each protocol.

  • Then I want to calculate the average of differences for each protocol. 14 + 14 = 28/3 = 9.3. 3 means 3 unique values for calculating differences – 1, 15, 29.

For protocol 222 it will be 28-14 = 14 divided by 2 (the number of unique values). Result 7.

Expected output

Protocol Average difference
111       9.3 
222       7 

>Solution :

Use DataFrame.drop_duplicates by both columns first and then aggregate lambda function for mean of differncies:

#remove traling whitespaces in columns names
df.columns = df.columns.str.strip()

df1 = (df.drop_duplicates(['Protocol','Day'])
         .groupby('Protocol')['Day']
         .agg(lambda x: x.diff().sum() / len(x))
         .reset_index(name='Average difference'))
print (df1)
   Protocol  Average difference
0       111            9.333333
1       222            7.000000

Or:

df1 = (df.drop_duplicates(['Protocol','Day'])
         .groupby('Protocol')['Day']
         .agg(lambda x: x.diff().fillna(0).mean())
         .reset_index(name='Average difference'))
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading