I have a bunch of columns in my dataframe with different values, as seen in this sample:
Especies Especies_0 Especies_1 Especies_2 Especies_3
2.20 3.44 1.90 1.24 0.00
2.20 3.04 2.55 0.00 0.00
1.88 2.19 0.00 0.00 0.00
2.20 3.44 2.28 2.55 0.00
3.44 2.20 0.00 0.00 0.00
2.20 2.58 0.00 0.00 0.00
1.88 2.19 0.00 0.00 0.00
3.44 1.91 3.04 1.83 3.98
3.44 2.20 0.00 0.00 0.00
2.20 2.55 1.90 0.00 0.00
1.88 2.20 0.00 0.00 0.00
The operation i want to perform is:
avg(abs(max - col) for col in cols)
where max is the maximum value of the columns in each row (for example, for the first row, max would be 3.44 and cols is the rest of the values in the columns), abs is the absolute function and avg means taking the average.
For example, for the first row, the operation would be: ((3.44-2.20)+(3.44-1.90)+(3.44-1.24))/3 = 1.66
and for the 5th row, with values (3.44, 2.20, 0.00, 0.00, 0.00) the result would be: (3.44 -2.20) /1 = 1.24
This is simple enough, but there’s a catch, I don’t want to consider the column of the max value, or any columns with 0.0 in them (take into account that the max value column changes, it’s not always the same as do the number of columns with 0.0 in them).
I have managed to do it with single, scalar values, I even did a function that does that
def ele_diff(esp0, esp1, esp2, esp3, esp4):
species = sorted([esp0, esp1, esp2, esp3, esp4])
diff = [species[-1] - spec for spec in species if spec != 0.0 and spec !=species[-1]]
return (sum(diff)/len(diff))
But I’m not able to apply my function to the dataframe. I’ve tried df.apply() and df.applymap(), but they don’t seem to work with the function I’ve made (applymap considers only 1 input and 1 output, while apply does not feed the function with each row separatedly, so the function returns ValueError because the truth value of a series is ambiguous).
I’ve also tried to do it directly with the dataframe, but as it’s got complex logic, I haven’t been able to come with a solution.
The main problem I’ve faced seems to be with checking that the values I’m going to substract are not 0.0 or the maximum.
>Solution :
First filter only necessary columns by starts substring Especies in DataFrame.filter, set missing values for maximal and 0 values, subtract maximal, convert to absolute values and last get average:
df1 = df.filter(regex='^Especies')
#or if necessary specify columns names in list
#cols = ['Especies', 'Especies_0', 'Especies_1', 'Especies_2', 'Especies_3']
#df1 = df[cols]
max1 = df1.max(axis=1)
mask = df1.ne(max1, axis=0) & df1.ne(0)
df['new'] = df1.where(mask).sub(max1, axis=0).abs().mean(axis=1)
print (df)
Especies Especies_0 Especies_1 Especies_2 Especies_3 new
0 2.20 3.44 1.90 1.24 0.00 1.660000
1 2.20 3.04 2.55 0.00 0.00 0.665000
2 1.88 2.19 0.00 0.00 0.00 0.310000
3 2.20 3.44 2.28 2.55 0.00 1.096667
4 3.44 2.20 0.00 0.00 0.00 1.240000
5 2.20 2.58 0.00 0.00 0.00 0.380000
6 1.88 2.19 0.00 0.00 0.00 0.310000
7 3.44 1.91 3.04 1.83 3.98 1.425000
8 3.44 2.20 0.00 0.00 0.00 1.240000
9 2.20 2.55 1.90 0.00 0.00 0.500000
10 1.88 2.20 0.00 0.00 0.00 0.320000
Details:
print (df1.where(mask))
Especies Especies_0 Especies_1 Especies_2 Especies_3
0 2.20 NaN 1.90 1.24 NaN
1 2.20 NaN 2.55 NaN NaN
2 1.88 NaN NaN NaN NaN
3 2.20 NaN 2.28 2.55 NaN
4 NaN 2.20 NaN NaN NaN
5 2.20 NaN NaN NaN NaN
6 1.88 NaN NaN NaN NaN
7 3.44 1.91 3.04 1.83 NaN
8 NaN 2.20 NaN NaN NaN
9 2.20 NaN 1.90 NaN NaN
10 1.88 NaN NaN NaN NaN
print (df1.where(mask).sub(max1, axis=0))
Especies Especies_0 Especies_1 Especies_2 Especies_3
0 -1.24 NaN -1.54 -2.20 NaN
1 -0.84 NaN -0.49 NaN NaN
2 -0.31 NaN NaN NaN NaN
3 -1.24 NaN -1.16 -0.89 NaN
4 NaN -1.24 NaN NaN NaN
5 -0.38 NaN NaN NaN NaN
6 -0.31 NaN NaN NaN NaN
7 -0.54 -2.07 -0.94 -2.15 NaN
8 NaN -1.24 NaN NaN NaN
9 -0.35 NaN -0.65 NaN NaN
10 -0.32 NaN NaN NaN NaN