The solutions I found online only show removing outliers from the entire dataframe, not just a specific column. So I’m having trouble figuring out how to perform outlier removal on a single column.
I tried creating a method, the code is shown below.
def find_outlier(df, column):
# Find first and third quartile
q1 = df[column].quantile(0.25)
q3 = df[column].quantile(0.75)
# Find interquartile range
IQR = q3 - q1
# Find lower and upper bound
lower_bound = q1 - 1.5 * IQR
upper_bound = q3 + 1.5 * IQR
# Remove outliers
df[column] = df[column][df[column] > lower_bound]
df[column] = df[column][df[column] < upper_bound]
return df
But when I ran the code, it said "Columns must be same length as key".
The code I used to run is shown below.
df['no_of_trainings'] = find_outlier(df, 'no_of_trainings')
Any help is appreciated.
>Solution :
The comparison result is by-index, so you can use it to reduce the DataFrame
df = df[df[column] > lower_bound]
df = df[df[column] < upper_bound]
return df
more concisely
...
return df[(df[column] > lower_bound) & (df[column] < upper_bound)]