Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Getting strange output when using group by apply with np.select function

I am working with a Timeseries data wherein I am trying to perform outlier detection using IQR method.

Sample Data:

import pandas as pd
import numpy as np

df = pd.DataFrame({'datecol' : pd.date_range('2024-1-1', '2024-12-31'),
                   'val' : np.random.random.randin(low = 100, high = 5000, size = 8366})

my function:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

def is_outlier(x):
    iqr = x.quantile(.75) - x.quantile(.25)
    outlier = (x <= x.quantile(.25) - 1.5*iqr) | (x >= x.quantile(.75) + 1.5*iqr)
    return np.select([outlier], [1], 0)

df.groupby(df['datecol'].dt.weekday)['val'].apply(is_outlier)

to which the output is something like below:

0    [1,1,0,0,....
1    [1,0,0,0,....
2    [1,1,0,0,....
3    [1,0,1,0,....
4    [1,1,0,0,....
5    [1,1,0,0,....
6    [1,0,0,1,....

I am expecting a single series as output which I can add back to the original dataframe as a flag column.

Can someone please help me with this

>Solution :

You should use groupby.transform, not apply:

df['flag'] = df.groupby(df['datecol'].dt.weekday)['val'].transform(is_outlier)

Alternatively, explicitly return a Series and use group_keys=False:

def is_outlier(x):
    iqr = x.quantile(.75) - x.quantile(.25)
    outlier = (x <= x.quantile(.25) - 1.5*iqr) | (x >= x.quantile(.75) + 1.5*iqr)
    return pd.Series(np.where(outlier, 1, 0), index=x.index)

df['flag'] = (df.groupby(df['datecol'].dt.weekday, group_keys=False)
              ['val'].apply(is_outlier)
              )

Note that with a single condition, np.where should be preferred to np.select.

You could also use a vectorial approach with groupby.quantile:

wd = df['datecol'].dt.weekday
g = df.groupby(wd)['val']
q25 = g.quantile(.25)
q75 = g.quantile(.75)
iqr = wd.map(q75-q25)
df['flag'] = 1 - df['val'].between(wd.map(q25) - 1.5*iqr, wd.map(q75) + 1.5*iqr)

Output:

       datecol   val  flag
0   2024-01-01  3193     0
1   2024-01-02  1044     0
2   2024-01-03  2963     0
3   2024-01-04  4448     0
4   2024-01-05  1286     0
..         ...   ...   ...
361 2024-12-27  1531     0
362 2024-12-28  4565     0
363 2024-12-29  3396     0
364 2024-12-30  1870     0
365 2024-12-31  3818     0
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading