Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Why do these different outlier methods fail to detect outliers?

I am trying to find the outliers by group for my dataframe. I have two groups: Group1 and Group2, and I am trying to find the best way to implement an outlier method

data = {'Group1':['A', 'A', 'A', 'B', 'B', 'B','A','A','B','B','B','A','A','A','B','B','B','A','A','A','B','B','B','A','A','A','A','A','B','B'], 'Group2':['C', 'C', 'C', 'C', 'D', 'D','C','D','C','C','D', 'C', 'C', 'D', 'D','C', 'C','D','D','D', 'D','C','D','C','C', 'D','C','D','C','C'], 'Age':[20, 21, 19, 24, 11, 15, 18, 1, 17,23, 35,2000,22,24,24,18,17,19,21,22,20,25,18,24,17,19,16,18,25,23]} 
df = pd.DataFrame(data) 

groups = df.groupby(['Group1', 'Group2'])
means = groups.Age.transform('mean')
stds = groups.Age.transform('std')

df['Flag'] = ~df.Age.between(means-stds*3, means+stds*3)

def flag_outlier(x):
    lower_limit  = np.mean(x) - np.std(x) * 3 
    upper_limit = np.mean(x) + np.std(x) * 3
    return (x>upper_limit)| (x<lower_limit)

df['Flag2'] = df.groupby(['Group1', 'Group2'])['Age'].apply(flag_outlier)

df["Flag3"] = df.groupby(['Group1', 'Group2'])['Age'].transform(lambda x: (x - x.mean()).abs() > 3*x.std())

However, all 3 methods fail to detect obvious outliers – for example, when Age is 2000, none of these methods treat it as an outlier. Is there a reason for this? Or is it possible that my code for all three outlier detection models is incorrect?

I have a strong feeling I’ve made a foolish mistake somewhere but I’m not sure where, so any help would be appreciated, thanks!

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

Within its group, that age of 2000 just isn’t over 3 standard deviations away from the group mean. The group mean is 239.666667 and the group standard deviation is 660.129722.

It might look like an obvious outlier to you, but you don’t have enough data to label it an outlier by that standard.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading