Why do these different outlier methods fail to detect outliers?

March 7, 2023

I am trying to find the outliers by group for my dataframe. I have two groups: Group1 and Group2, and I am trying to find the best way to implement an outlier method

data = {'Group1':['A', 'A', 'A', 'B', 'B', 'B','A','A','B','B','B','A','A','A','B','B','B','A','A','A','B','B','B','A','A','A','A','A','B','B'], 'Group2':['C', 'C', 'C', 'C', 'D', 'D','C','D','C','C','D', 'C', 'C', 'D', 'D','C', 'C','D','D','D', 'D','C','D','C','C', 'D','C','D','C','C'], 'Age':[20, 21, 19, 24, 11, 15, 18, 1, 17,23, 35,2000,22,24,24,18,17,19,21,22,20,25,18,24,17,19,16,18,25,23]} 
df = pd.DataFrame(data) 

groups = df.groupby(['Group1', 'Group2'])
means = groups.Age.transform('mean')
stds = groups.Age.transform('std')

df['Flag'] = ~df.Age.between(means-stds*3, means+stds*3)

def flag_outlier(x):
    lower_limit  = np.mean(x) - np.std(x) * 3 
    upper_limit = np.mean(x) + np.std(x) * 3
    return (x>upper_limit)| (x<lower_limit)

df['Flag2'] = df.groupby(['Group1', 'Group2'])['Age'].apply(flag_outlier)

df["Flag3"] = df.groupby(['Group1', 'Group2'])['Age'].transform(lambda x: (x - x.mean()).abs() > 3*x.std())

However, all 3 methods fail to detect obvious outliers – for example, when Age is 2000, none of these methods treat it as an outlier. Is there a reason for this? Or is it possible that my code for all three outlier detection models is incorrect?

I have a strong feeling I’ve made a foolish mistake somewhere but I’m not sure where, so any help would be appreciated, thanks!

>Solution :

Within its group, that age of 2000 just isn’t over 3 standard deviations away from the group mean. The group mean is 239.666667 and the group standard deviation is 660.129722.

It might look like an obvious outlier to you, but you don’t have enough data to label it an outlier by that standard.