Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Removing outliers in a df containing mixed dtype

I am working on a pandas DataFrame containing numerical columns as well as string columns (dtype is object), and would like to remove the rows containing outliers with respect to the distributions within a column. In other words, detect the outliers in each column and drop the corresponding rows.

I have found two solutions to this, but neither takes into account that my df does not contain only numbers, hence they both result in errors (when encountering strings, I assume).

Way 1:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

from scipy import stats
df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]

returns TypeError: unsupported operand type(s) for /: 'str' and 'int'.
This is why I guess the error arises from the df having mixed dtypes.

Way 2:

for col in df.columns:
    lower = df[col].quantile(0.05)
    upper = df[col].quantile(0.95)
    df = df[col].clip(lower=lower, upper=upper)

returns KeyError with this traceback:

File omissis, in Class.remove_outliers(self, df)
    423 def remove_outliers(self, df):
    424     for col in df.columns:
--> 425         lower = df[col].quantile(0.05)
    426         upper = df[col].quantile(0.95)
    427         df = df[col].clip(lower=lower, upper=upper)

File omissis, in Series.__getitem__(self, key)
    955     return self._values[key]
    957 elif key_is_scalar:
--> 958     return self._get_value(key)
    960 if is_hashable(key):
    961     # Otherwise index.get_value will raise InvalidIndexError
    962     try:
    963         # For labels that don't resolve as scalars like tuples and frozensets

File omissis, in Series._get_value(self, label, takeable)
   1066     return self._values[label]
   1068 # Similar to Index.get_value, but we do not fall back to positional
-> 1069 loc = self.index.get_loc(label)
   1070 return self.index._get_values_for_loc(self, loc, label)

File omissis, in RangeIndex.get_loc(self, key, method, tolerance)
    387             raise KeyError(key) from err
    388     self._check_indexing_error(key)
--> 389     raise KeyError(key)
    390 return super().get_loc(key, method=method, tolerance=tolerance)

KeyError: 'colname'

How would you solve this?

EDIT: the idea is to skip the non numeric columns, to ignore them.

>Solution :

I would break the problem into stages:

Firstly, identify (numeric) columns you want to do the outlier removal.
Reference

newdf = df.select_dtypes(include=np.number)

Now perform whatever filtering/outlier removal you want on the rows of newdf. Afterwards, newdf should contain only rows you wish to retain.

Then keep only the rows of df those index are in newdf. Reference

df = df[df.index.isin(newdf.index)]
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading