Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Pandas mask with composite expression behaviour

this question was previously asked (and then deleted) by an user, I was looking to find a solution so I could give out an answer when the question disappeared and I, moreover, can’t seem to make sense of pandas’ behaviour so I would appreciate some clarity, the original question stated something along the lines of:

How can I replace every negative value except those in a given list with NaN in a Pandas dataframe?

my setup to reproduce the scenario is the following:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A' : [x for x in range(4)],
    'B' : [x for x in range(-2, 2)]
})

this should technically only be an issue of correctly passing a boolean expression to pd.where, my attemped solution looks like:

df[df >= 0 | df.isin([-2])] 

which produces:

index A B
0 0 NaN
1 1 NaN
2 2 0
3 3 1

which also cancels the number in the list!

moreover if I mask the dataframe with each of the two conditions I get the correct behavior:

with df[df >= 0] (identical to the compound result)

index A B
0 0 NaN
1 1 NaN
2 2 0
3 3 1

with df[df.isin([-2])] (identical to the compound result)

index A B
0 NaN -2.0
1 NaN NaN
2 NaN NaN
3 NaN NaN

So it seems like I am

  1. Running into some undefined behaviour as a result of performing logic on NaN values
  2. I have got something wrong

Anyone can clarify this situation to me?

>Solution :

Solution

df[(df >= 0) | (df.isin([-2]))] 

Explanation

In python, bitwise OR, |, has a higher operator precedence than comparison operators like >=: https://docs.python.org/3/reference/expressions.html#operator-precedence

When filtering a pandas DataFrame on multiple boolean conditions, you need to enclose each condition in parentheses. More from the boolean indexing section of the pandas user guide:

Another common operation is the use of boolean vectors to filter the
data. The operators are: | for or, & for and, and ~ for not. These
must be grouped by using parentheses, since by default Python will
evaluate an expression such as df['A'] > 2 & df['B'] < 3 as df['A'] > (2 & df['B']) < 3, while the desired evaluation order is (df['A'] > 2) & (df['B'] < 3).

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading