I am confused about different results of boolean indexing when using ~ after != versus when using just ==
I have a pandas df with 4 columns:
dic = {
"a": [1,1,1,0,0,1,1],
"b": [0,0,1,1,0,0,0],
"c": [1,0,1,0,0,1,0],
"d": [0,0,1,0,0,1,0],
}
df = pd.DataFrame(data=dic)
print(df)
a b c d
0 1 0 1 0
1 1 0 0 0
2 1 1 1 1
3 0 1 0 0
4 0 0 0 0
5 1 0 1 1
6 1 0 0 0
I want to subset the whole df dataframe:
I want to remove all rows which have all elements zero, but just on the columns b c d, and not on a.
If I use ~ (not) operator after == I get the desired result:
names = ["b","c","d"]
df_A = df.loc[~(df[names] == 0.0).all(axis=1)]
print(df_A)
a b c d
0 1 0 1 0
2 1 1 1 1
3 0 1 0 0
5 1 0 1 1
But when I use just == I get different result:
names = ["b","c","d"]
df_B = df.loc[(df[names] != 0.0).all(axis=1)]
print(df_B)
a b c d
2 1 1 1 1
Do you have any idea why is this the case? Should these two not be the same?
Thank you.
>Solution :
You’re not correctly following De Morgan’s law.
- not (A or B) = (not A) and (not B)
- not (A and B) = (not A) or (not B)
If you use the opposite condition as input, you have to replace all (AND) by any (OR):
df_B = df.loc[(df[names] != 0.0).any(axis=1)]
In English this would be:
- I want to REMOVE (
~) rows for which ALL values are 0 (== 0) - I want to KEEP rows for which ANY value is NOT 0 (
!= 0)
In short the rules for inverting input conditions are:
- swapping strict/equal operators (e.g.
>=becomes<;>becomes<=) - AND become OR and OR becomes AND
- adding/removing NOT (
~)
Output:
a b c d
0 1 0 1 0
2 1 1 1 1
3 0 1 0 0
5 1 0 1 1