Home Why the difference in checking for value of pd.DataFrame vs pd.Series if value in index?

Questions

Why the difference in checking for value of pd.DataFrame vs pd.Series if value in index?

May 18, 2024

I’m working with a pandas DataFrame and I noticed a difference in behavior when using the in operator.

Here’s an example to illustrate this:

import pandas as pd

df = pd.DataFrame({'a': [4, 5, 6], 'b': [7, 8, 9]})

print(1 in df)
print(type(df))

print(1 in df["a"])
print(type(df["a"]))

Output:

False
<class 'pandas.core.frame.DataFrame'>
True
<class 'pandas.core.series.Series'>

The most straightforward difference is that of course one object is a DataFrame, the other is a Series; nonetheless, I was not really expecting 1 to be found in the Index of the Series and this expression evaluating to True. Especially as it is False for the DataFrame.

Is there an explanation why I should have been expecting this?

>Solution :

In both cases the in operator calls __contains__ to test membership. Both pd.DataFrame and pd.Series are subclasses of NDFrame, which has this method defined as follows:

    def __contains__(self, key) -> bool:
        """True if the key is in the info axis"""
        return key in self._info_axis

So, under the hood the following happens:

print(df._info_axis)
# Index(['a', 'b'], dtype='object')

print(df.__contains__(1))
# False

# approx.: 1 in ['a', 'b']

print(df['a']._info_axis)
# RangeIndex(start=0, stop=3, step=1)

print(df['a'].__contains__(1))
# True

# approx.: 1 in [0, 1, 2]

I.e., the difference lies in the fact that a df uses df.columns as the ‘info axis’, while a series, such as df['a'], naturally must use series.index.