Home Calculating the differences between unique values in the column and averaging by number of unique values with grouping by another column in Python

Questions

Calculating the differences between unique values in the column and averaging by number of unique values with grouping by another column in Python

byMR

March 14, 2022

I have a dataframe and need to do some custom calculations.

Protocol    Visit   Day
111 1   1
111 1   1
111 1   1
111 2   15
111 2   15
111 2   15
111 3   29
111 3   29
222 2   14
222 2   14
222 2   14
222 3   28
222 3   28
222 3   28

Reproducible input:

{'Protocol ': {0: 111, 1: 111, 2: 111, 3: 111, 4: 111, 5: 111, 6: 111, 7: 111, 8: 222, 9: 222, 10: 222, 11: 222, 12: 222, 13: 222, 14: 222}, 'Visit': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 2, 6: 3, 7: 3, 8: 1, 9: 2, 10: 2, 11: 2, 12: 3, 13: 3, 14: 3}, 'Day': {0: 1, 1: 1, 2: 1, 3: 15, 4: 15, 5: 15, 6: 29, 7: 29, 8: 1, 9: 14, 10: 14, 11: 14, 12: 28, 13: 28, 14: 28}}

What do I want:

Calculate the difference between unique values between values in Day column in a certain way.

For instance, for protocol 111 it will be 15-1 = 14 and 29 - 15 = 14. The same logic should be apply for all numbers in the column for a specific protocol, because I will have more for each protocol.

Then I want to calculate the average of differences for each protocol. 14 + 14 = 28/3 = 9.3. 3 means 3 unique values for calculating differences – 1, 15, 29.

For protocol 222 it will be 28-14 = 14 divided by 2 (the number of unique values). Result 7.

Expected output

Protocol Average difference
111       9.3 
222       7

>Solution :

Use DataFrame.drop_duplicates by both columns first and then aggregate lambda function for mean of differncies:

#remove traling whitespaces in columns names
df.columns = df.columns.str.strip()

df1 = (df.drop_duplicates(['Protocol','Day'])
         .groupby('Protocol')['Day']
         .agg(lambda x: x.diff().sum() / len(x))
         .reset_index(name='Average difference'))
print (df1)
   Protocol  Average difference
0       111            9.333333
1       222            7.000000

Or:

df1 = (df.drop_duplicates(['Protocol','Day'])
         .groupby('Protocol')['Day']
         .agg(lambda x: x.diff().fillna(0).mean())
         .reset_index(name='Average difference'))