I’m working with a pandas DataFrame and I noticed a difference in behavior when using the in operator.
Here’s an example to illustrate this:
import pandas as pd
df = pd.DataFrame({'a': [4, 5, 6], 'b': [7, 8, 9]})
print(1 in df)
print(type(df))
print(1 in df["a"])
print(type(df["a"]))
Output:
False
<class 'pandas.core.frame.DataFrame'>
True
<class 'pandas.core.series.Series'>
The most straightforward difference is that of course one object is a DataFrame, the other is a Series; nonetheless, I was not really expecting 1 to be found in the Index of the Series and this expression evaluating to True. Especially as it is False for the DataFrame.
Is there an explanation why I should have been expecting this?
>Solution :
In both cases the in operator calls __contains__ to test membership. Both pd.DataFrame and pd.Series are subclasses of NDFrame, which has this method defined as follows:
def __contains__(self, key) -> bool:
"""True if the key is in the info axis"""
return key in self._info_axis
So, under the hood the following happens:
print(df._info_axis)
# Index(['a', 'b'], dtype='object')
print(df.__contains__(1))
# False
# approx.: 1 in ['a', 'b']
print(df['a']._info_axis)
# RangeIndex(start=0, stop=3, step=1)
print(df['a'].__contains__(1))
# True
# approx.: 1 in [0, 1, 2]
I.e., the difference lies in the fact that a df uses df.columns as the ‘info axis’, while a series, such as df['a'], naturally must use series.index.