Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Conditional math operations with columns in a pandas dataframe

I have a bunch of columns in my dataframe with different values, as seen in this sample:

Especies  Especies_0  Especies_1  Especies_2  Especies_3
2.20        3.44        1.90        1.24        0.00
2.20        3.04        2.55        0.00        0.00
1.88        2.19        0.00        0.00        0.00
2.20        3.44        2.28        2.55        0.00
3.44        2.20        0.00        0.00        0.00
2.20        2.58        0.00        0.00        0.00
1.88        2.19        0.00        0.00        0.00
3.44        1.91        3.04        1.83        3.98
3.44        2.20        0.00        0.00        0.00
2.20        2.55        1.90        0.00        0.00
1.88        2.20        0.00        0.00        0.00

The operation i want to perform is:

avg(abs(max - col) for col in cols)

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

where max is the maximum value of the columns in each row (for example, for the first row, max would be 3.44 and cols is the rest of the values in the columns), abs is the absolute function and avg means taking the average.

For example, for the first row, the operation would be: ((3.44-2.20)+(3.44-1.90)+(3.44-1.24))/3 = 1.66

and for the 5th row, with values (3.44, 2.20, 0.00, 0.00, 0.00) the result would be: (3.44 -2.20) /1 = 1.24

This is simple enough, but there’s a catch, I don’t want to consider the column of the max value, or any columns with 0.0 in them (take into account that the max value column changes, it’s not always the same as do the number of columns with 0.0 in them).

I have managed to do it with single, scalar values, I even did a function that does that

def ele_diff(esp0, esp1, esp2, esp3, esp4):
    species = sorted([esp0, esp1, esp2, esp3, esp4])
    diff = [species[-1] - spec for spec in species if spec != 0.0 and spec !=species[-1]]
    return (sum(diff)/len(diff))

But I’m not able to apply my function to the dataframe. I’ve tried df.apply() and df.applymap(), but they don’t seem to work with the function I’ve made (applymap considers only 1 input and 1 output, while apply does not feed the function with each row separatedly, so the function returns ValueError because the truth value of a series is ambiguous).

I’ve also tried to do it directly with the dataframe, but as it’s got complex logic, I haven’t been able to come with a solution.

The main problem I’ve faced seems to be with checking that the values I’m going to substract are not 0.0 or the maximum.

>Solution :

First filter only necessary columns by starts substring Especies in DataFrame.filter, set missing values for maximal and 0 values, subtract maximal, convert to absolute values and last get average:

df1 = df.filter(regex='^Especies')

#or if necessary specify columns names in list
#cols = ['Especies', 'Especies_0', 'Especies_1', 'Especies_2', 'Especies_3']
#df1 = df[cols]

max1 = df1.max(axis=1)
mask = df1.ne(max1, axis=0) & df1.ne(0)
df['new'] = df1.where(mask).sub(max1, axis=0).abs().mean(axis=1)
print (df)
    Especies  Especies_0  Especies_1  Especies_2  Especies_3       new
0       2.20        3.44        1.90        1.24        0.00  1.660000
1       2.20        3.04        2.55        0.00        0.00  0.665000
2       1.88        2.19        0.00        0.00        0.00  0.310000
3       2.20        3.44        2.28        2.55        0.00  1.096667
4       3.44        2.20        0.00        0.00        0.00  1.240000
5       2.20        2.58        0.00        0.00        0.00  0.380000
6       1.88        2.19        0.00        0.00        0.00  0.310000
7       3.44        1.91        3.04        1.83        3.98  1.425000
8       3.44        2.20        0.00        0.00        0.00  1.240000
9       2.20        2.55        1.90        0.00        0.00  0.500000
10      1.88        2.20        0.00        0.00        0.00  0.320000

Details:

print (df1.where(mask))
    Especies  Especies_0  Especies_1  Especies_2  Especies_3
0       2.20         NaN        1.90        1.24         NaN
1       2.20         NaN        2.55         NaN         NaN
2       1.88         NaN         NaN         NaN         NaN
3       2.20         NaN        2.28        2.55         NaN
4        NaN        2.20         NaN         NaN         NaN
5       2.20         NaN         NaN         NaN         NaN
6       1.88         NaN         NaN         NaN         NaN
7       3.44        1.91        3.04        1.83         NaN
8        NaN        2.20         NaN         NaN         NaN
9       2.20         NaN        1.90         NaN         NaN
10      1.88         NaN         NaN         NaN         NaN

print (df1.where(mask).sub(max1, axis=0))
    Especies  Especies_0  Especies_1  Especies_2  Especies_3
0      -1.24         NaN       -1.54       -2.20         NaN
1      -0.84         NaN       -0.49         NaN         NaN
2      -0.31         NaN         NaN         NaN         NaN
3      -1.24         NaN       -1.16       -0.89         NaN
4        NaN       -1.24         NaN         NaN         NaN
5      -0.38         NaN         NaN         NaN         NaN
6      -0.31         NaN         NaN         NaN         NaN
7      -0.54       -2.07       -0.94       -2.15         NaN
8        NaN       -1.24         NaN         NaN         NaN
9      -0.35         NaN       -0.65         NaN         NaN
10     -0.32         NaN         NaN         NaN         NaN
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading