Conditional math operations with columns in a pandas dataframe

June 12, 2023

I have a bunch of columns in my dataframe with different values, as seen in this sample:

Especies  Especies_0  Especies_1  Especies_2  Especies_3
2.20        3.44        1.90        1.24        0.00
2.20        3.04        2.55        0.00        0.00
1.88        2.19        0.00        0.00        0.00
2.20        3.44        2.28        2.55        0.00
3.44        2.20        0.00        0.00        0.00
2.20        2.58        0.00        0.00        0.00
1.88        2.19        0.00        0.00        0.00
3.44        1.91        3.04        1.83        3.98
3.44        2.20        0.00        0.00        0.00
2.20        2.55        1.90        0.00        0.00
1.88        2.20        0.00        0.00        0.00

The operation i want to perform is:

avg(abs(max - col) for col in cols)

where max is the maximum value of the columns in each row (for example, for the first row, max would be 3.44 and cols is the rest of the values in the columns), abs is the absolute function and avg means taking the average.

For example, for the first row, the operation would be: ((3.44-2.20)+(3.44-1.90)+(3.44-1.24))/3 = 1.66

and for the 5th row, with values (3.44, 2.20, 0.00, 0.00, 0.00) the result would be: (3.44 -2.20) /1 = 1.24

This is simple enough, but there’s a catch, I don’t want to consider the column of the max value, or any columns with 0.0 in them (take into account that the max value column changes, it’s not always the same as do the number of columns with 0.0 in them).

I have managed to do it with single, scalar values, I even did a function that does that

def ele_diff(esp0, esp1, esp2, esp3, esp4):
    species = sorted([esp0, esp1, esp2, esp3, esp4])
    diff = [species[-1] - spec for spec in species if spec != 0.0 and spec !=species[-1]]
    return (sum(diff)/len(diff))

But I’m not able to apply my function to the dataframe. I’ve tried df.apply() and df.applymap(), but they don’t seem to work with the function I’ve made (applymap considers only 1 input and 1 output, while apply does not feed the function with each row separatedly, so the function returns ValueError because the truth value of a series is ambiguous).

I’ve also tried to do it directly with the dataframe, but as it’s got complex logic, I haven’t been able to come with a solution.

The main problem I’ve faced seems to be with checking that the values I’m going to substract are not 0.0 or the maximum.

>Solution :

First filter only necessary columns by starts substring Especies in DataFrame.filter, set missing values for maximal and 0 values, subtract maximal, convert to absolute values and last get average:

df1 = df.filter(regex='^Especies')

#or if necessary specify columns names in list
#cols = ['Especies', 'Especies_0', 'Especies_1', 'Especies_2', 'Especies_3']
#df1 = df[cols]

max1 = df1.max(axis=1)
mask = df1.ne(max1, axis=0) & df1.ne(0)
df['new'] = df1.where(mask).sub(max1, axis=0).abs().mean(axis=1)
print (df)
    Especies  Especies_0  Especies_1  Especies_2  Especies_3       new
0       2.20        3.44        1.90        1.24        0.00  1.660000
1       2.20        3.04        2.55        0.00        0.00  0.665000
2       1.88        2.19        0.00        0.00        0.00  0.310000
3       2.20        3.44        2.28        2.55        0.00  1.096667
4       3.44        2.20        0.00        0.00        0.00  1.240000
5       2.20        2.58        0.00        0.00        0.00  0.380000
6       1.88        2.19        0.00        0.00        0.00  0.310000
7       3.44        1.91        3.04        1.83        3.98  1.425000
8       3.44        2.20        0.00        0.00        0.00  1.240000
9       2.20        2.55        1.90        0.00        0.00  0.500000
10      1.88        2.20        0.00        0.00        0.00  0.320000

Details:

print (df1.where(mask))
    Especies  Especies_0  Especies_1  Especies_2  Especies_3
0       2.20         NaN        1.90        1.24         NaN
1       2.20         NaN        2.55         NaN         NaN
2       1.88         NaN         NaN         NaN         NaN
3       2.20         NaN        2.28        2.55         NaN
4        NaN        2.20         NaN         NaN         NaN
5       2.20         NaN         NaN         NaN         NaN
6       1.88         NaN         NaN         NaN         NaN
7       3.44        1.91        3.04        1.83         NaN
8        NaN        2.20         NaN         NaN         NaN
9       2.20         NaN        1.90         NaN         NaN
10      1.88         NaN         NaN         NaN         NaN

print (df1.where(mask).sub(max1, axis=0))
    Especies  Especies_0  Especies_1  Especies_2  Especies_3
0      -1.24         NaN       -1.54       -2.20         NaN
1      -0.84         NaN       -0.49         NaN         NaN
2      -0.31         NaN         NaN         NaN         NaN
3      -1.24         NaN       -1.16       -0.89         NaN
4        NaN       -1.24         NaN         NaN         NaN
5      -0.38         NaN         NaN         NaN         NaN
6      -0.31         NaN         NaN         NaN         NaN
7      -0.54       -2.07       -0.94       -2.15         NaN
8        NaN       -1.24         NaN         NaN         NaN
9      -0.35         NaN       -0.65         NaN         NaN
10     -0.32         NaN         NaN         NaN         NaN