Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

When applying `Series.clip` to ordered categorical data, I get the wrong result

Let’s say we have a series of ordered categorical data:

s = pd.Series(['zero','one','two','three','four','five','six'], dtype='category')
s = s.cat.reorder_categories(['zero','one','two','three','four','five','six'], ordered=True)
print(s)

Here’s an output of the code above:

0     zero
1      one
2      two
3    three
4     four
5     five
6      six
dtype: category
Categories (7, object): ['zero' < 'one' < 'two' < 'three' < 'four' < 'five' < 'six']

If I apply clip with only one of lower= or upper= parameter, I get the expected output. But if both of them are passed then the output is somewhat strange:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

df = pd.concat([s, s.clip(lower='two'), s.clip(upper='four'), s.clip(lower='two', upper='four')], axis=1)
print(df)

Output, note the last column:

       0      1      2     3
0   zero    two   zero  four
1    one    two    one  four
2    two    two    two  four
3  three  three  three   two
4   four   four   four   two
5   five   five   four   two
6    six    six   four   two

I expected to see in the last column a series ['two','two','two','three','four','four',four'], not the one in the output above.

Also I noticed that replacing of the values for lower=..., upper=... doesn’t change the result. See:

pd.concat([s.clip(lower='two',upper='four'), s.clip(lower='four',upper='two')], axis=1)

with output:

      0     1
0  four  four
1  four  four
2  four  four
3   two   two
4   two   two
5   two   two
6   two   two

Why clipping of the ordered categorical data behaves like that?


Pandas 1.5.1
Numpy 1.23.4
Python 3.10.7

>Solution :

This looks like a bug. In the source there is a check that the upper/lower bounds are in the correct order. However I think this is mistakenly using the string values of the categories ('four' < 'two'):

   8095         # GH 2747 (arguments were reversed)
   8096         if (
   8097             lower is not None
   8098             and upper is not None
   8099             and is_scalar(lower)
   8100             and is_scalar(upper)
   8101         ):
   8102             lower, upper = min(lower, upper), max(lower, upper)  # <- HERE

Indeed using zfour in place of four:

s = pd.Series(['zero','one','two','three','zfour','five','six'], dtype='category')
s = s.cat.reorder_categories(['zero','one','two','three','zfour','five','six'], ordered=True)
s.clip(lower='two', upper='zfour')

output:

0      two
1      two
2      two
3    three
4    zfour
5    zfour
6    zfour
dtype: category
Categories (7, object): ['zero' < 'one' < 'two' < 'three' < 'zfour' < 'five' < 'six']
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading