this question was previously asked (and then deleted) by an user, I was looking to find a solution so I could give out an answer when the question disappeared and I, moreover, can’t seem to make sense of pandas’ behaviour so I would appreciate some clarity, the original question stated something along the lines of:
How can I replace every negative value except those in a given list with NaN in a Pandas dataframe?
my setup to reproduce the scenario is the following:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A' : [x for x in range(4)],
'B' : [x for x in range(-2, 2)]
})
this should technically only be an issue of correctly passing a boolean expression to pd.where, my attemped solution looks like:
df[df >= 0 | df.isin([-2])]
which produces:
| index | A | B |
|---|---|---|
| 0 | 0 | NaN |
| 1 | 1 | NaN |
| 2 | 2 | 0 |
| 3 | 3 | 1 |
which also cancels the number in the list!
moreover if I mask the dataframe with each of the two conditions I get the correct behavior:
with df[df >= 0] (identical to the compound result)
| index | A | B |
|---|---|---|
| 0 | 0 | NaN |
| 1 | 1 | NaN |
| 2 | 2 | 0 |
| 3 | 3 | 1 |
with df[df.isin([-2])] (identical to the compound result)
| index | A | B |
|---|---|---|
| 0 | NaN | -2.0 |
| 1 | NaN | NaN |
| 2 | NaN | NaN |
| 3 | NaN | NaN |
So it seems like I am
- Running into some undefined behaviour as a result of performing logic on NaN values
- I have got something wrong
Anyone can clarify this situation to me?
>Solution :
Solution
df[(df >= 0) | (df.isin([-2]))]
Explanation
In python, bitwise OR, |, has a higher operator precedence than comparison operators like >=: https://docs.python.org/3/reference/expressions.html#operator-precedence
When filtering a pandas DataFrame on multiple boolean conditions, you need to enclose each condition in parentheses. More from the boolean indexing section of the pandas user guide:
Another common operation is the use of boolean vectors to filter the
data. The operators are:|foror,&forand, and~fornot. These
must be grouped by using parentheses, since by default Python will
evaluate an expression such asdf['A'] > 2 & df['B'] < 3asdf['A'] > (2 & df['B']) < 3, while the desired evaluation order is(df['A'] > 2) & (df['B'] < 3).