Calculate %-deviation with values from a pandas Dataframe

November 25, 2022

I am fairly new to python and I have the following dataframe

    setting_id  subject_id  seconds  result_id  owner_id  average  duration_id
0            7           1        0     1680.5       2.0   24.000          1.0
1            7           1     3600     1690.5       2.0   46.000          2.0
2            7           1    10800     1700.5       2.0  101.000          4.0
3            7           2        0     1682.5       2.0   12.500          1.0
4            7           2     3600     1692.5       2.0   33.500          2.0
5            7           2    10800     1702.5       2.0   86.500          4.0
6            7           3        0     1684.5       2.0    8.500          1.0
7            7           3     3600     1694.5       2.0   15.000          2.0
8            7           3    10800     1704.5       2.0   34.000          4.0

What I need to do is Calculate the deviation (%) from averages with a "seconds"-value not equal to 0 from those averages with a seconds value of zero, where the subject_id and Setting_id are the same

i.e. setting_id ==7 & subject_id ==1 would be:

(result/baseline)*100

——> for 3600 seconds: (46/24)*100 = +192%

——> for 10800 seconds: (101/24)*100 = +421%

…. baseline = average-result with a seconds value of 0

…. result = average-result with a seconds value other than 0

The resulting df should look like this

   setting_id  subject_id  seconds  owner_id  average  deviation  duration_id
0           7           1        0         2       24          0            1
1           7           1     3600         2       46        192            2
2           7           1    10800         2      101        421            4

I want to use these calculations then to plot a regression graph (with seaborn) of deviations from baseline

I have played around with this df for 2 days now and tried different forloops but I just can´t figure out the correct way.

>Solution :

You can use:

# identify rows with 0
m = df['seconds'].eq(0)
# compute the sum of rows with 0
s = (df['average'].where(m)
     .groupby([df['setting_id'], df['subject_id']])
     .sum()
    )

# compute the deviation per group
deviation = (
 df[['setting_id', 'subject_id']]
 .merge(s, left_on=['setting_id', 'subject_id'], right_index=True, how='left')
 ['average']
 .rdiv(df['average']).mul(100)
 .round().astype(int) # optional
 .mask(m, 0)
)

df['deviation'] = deviation
# or
# out = df.assign(deviation=deviation)

Output:

   setting_id  subject_id  seconds  result_id  owner_id  average  duration_id  deviation
0           7           1        0     1680.5       2.0     24.0          1.0          0
1           7           1     3600     1690.5       2.0     46.0          2.0        192
2           7           1    10800     1700.5       2.0    101.0          4.0        421
3           7           2        0     1682.5       2.0     12.5          1.0          0
4           7           2     3600     1692.5       2.0     33.5          2.0        268
5           7           2    10800     1702.5       2.0     86.5          4.0        692
6           7           3        0     1684.5       2.0      8.5          1.0          0
7           7           3     3600     1694.5       2.0     15.0          2.0        176
8           7           3    10800     1704.5       2.0     34.0          4.0        400