Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Why is no warning thrown for indexing a Series of values with a bool Series that's too long?

I have the following code:

import pandas as pd

series_source = pd.Series([1, 2, 3, 4], dtype=int)
normal_index = pd.Series([True, False, True, True], dtype=bool)
big_index = pd.Series([True, False, True, True, False, True], dtype=bool)

# Both indexes give back: pd.Series([1, 2, 3, 4], dtype=int)
# no warnings are raised!
assert (series_source[normal_index] == series_source[big_index]).all() 

df_source = pd.DataFrame(
    [
        [1, 2, 3, 4],
        [5, 6, 7, 8],
        [9, 10, 11, 12],
        [13, 14, 15, 16]
    ]
)

# no warning - works as expected: grabs rows 0, 2, and 3
df_normal_result = df_source[normal_index]

# UserWarning: Boolean Series key will be reindexed to match DataFrame index.
# (but still runs)
df_big_result = df_source[big_index]

# passes - they are equivalent
assert df_normal_result.equals(df_big_result)
print("Complete")

Why is it that indexing the series_source with the big_index doesn’t raise a warning, even though the big index has more values than the source? What is pandas doing under the hood in order to do the Series indexing?

(Contrast this to indexing the df_source, where an explicit warning is raised that big_index needs to be re-indexed in order for the operation to work.)

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

In the indexing docs, it claims that:

Using a boolean vector to index a Series works exactly as in a NumPy
ndarray

However, if I do

import numpy as np

a = np.array([1, 2, 3, 4, 5])
b = np.array([True, False, True, True, False])
c = np.array([True, False, True, True, False, True, True])

# returns an ndarray of [1,3, 4] as expected
print(a[b])

# raises IndexError: boolean index did not match indexed array along axis 0;
# size of axis is 5 but size of corresponding boolean axis is 7
print(a[c])

So it does not seem that this functionality matches Numpy as the docs claim. What’s going on?

(My versions are pandas==2.2.2 and numpy==2.0.0.)

>Solution :

Because the indexing Series is first aligned to the index of the indexed DataFrame/Series.

In short, pandas is doing:

tmp = big_index.reindex(df.index)
df_big_result = df_source[tmp]

Example for a Series:

pd.Series([0,1,2])[pd.Series([True, True, False], index=[1,2,0])]

#  1    1
#  2    2
#  dtype: int64

You can actually observe this yourself if you change the indices of the indexing Series:

big_index2 = pd.Series([False, False, True, True, True, True],
                       index=[4,5,0,1,2,3], dtype=bool)
df_source[big_index2]

Output:

    0   1   2   3
0   1   2   3   4
1   5   6   7   8
2   9  10  11  12
3  13  14  15  16

We have 4 rows in the output, despite the first two values being False. After reindexing, the boolean values are [True, True, True, True].

You should get a warning in this case:

UserWarning: Boolean Series key will be reindexed to match DataFrame index.

Note that if alignment cannot be done, then an error will be raised, like in numpy:

pd.Series([0,1,2])[pd.Series([True, True, False], index=[1,2,3])]
# IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).

pd.Series([0,1,2])[[True, False, False, True]]
# IndexError: Boolean index has wrong length: 4 instead of 3

why does it happen with DataFrame[Series] but not Series[Series]?

Because there is a check for DataFrame[Series]

# internal for DataFrame.__getitem__
     def __getitem__(self, key):
        # ...
        if isinstance(key, Series) and not key.index.equals(self.index):
            warnings.warn(
                "Boolean Series key will be reindexed to match DataFrame index.",
                UserWarning,
                stacklevel=find_stack_level(),
            )

# internal for Series.__getitem__
         if com.is_bool_indexer(key):
             key = check_bool_indexer(self.index, key)
             key = np.asarray(key, dtype=bool)
             return self._get_rows_with_mask(key)

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading