Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Elimination of outliers with z-score method in Python

I am cleaning a dataset using the z-score with a threshold >3.
Below is the code that I am using. As you can, I first calculate the mean and std. After the code goes in a loop and checks for every value the z-score and if it is greater than 3 and, if yes, the value is treated as an outlier which is first added to the list "outlier". At last the outlier list is deleted for the dataset.

"""SD MonthlyIncome"""
MonthlyIncome_std = df ['MonthlyIncome'].std()
MonthlyIncome_std

"""MEAN MonthlyIncome"""
MonthlyIncome_mean = df ['MonthlyIncome'].mean()
MonthlyIncome_mean

threshold = 3
outlier = [] 
for i in df ['MonthlyIncome']: 
    z = (i-MonthlyIncome_mean)/MonthlyIncome_std 
    if z >= threshold: 
        outlier.append(i)
        df = df[~df.MonthlyIncome.isin(outlier)]

The above code works fine, the fact is that I have to write it for every numerical column.
I was trying to create a function that does the same and it is replicable for every numerical column. Below the function:

    for col in df.columns:
        if df[col].dtypes == 'float64' or df[col].dtypes == 'int64':
            threshold = 3
            outlier = []
            col_mean = col.mean()
            col_std = col.std()
            z = (i-col_mean)/col_std
            if z >= threshold: 
                outlier.append(i) 
                df = df[~df.col.isin(outlier)]
AttributeError                            Traceback (most recent call last)
<ipython-input-62-4f8b1224061e> in <module>
----> 1 z_score_elimination(df)

<ipython-input-61-dc3c84b60dd1> in z_score_elimination(df)
      4             threshold = 3
      5             outlier = []
----> 6             col_mean = col.mean()
      7             col_std = col.std()
      8             z = (i-col_mean)/col_std

AttributeError: 'str' object has no attribute 'mean'

How can I fix the code?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

col is the string of the column name. I think you want to do col_mean = df[col].mean() and col_std = df[col].std()

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading