I am working on a pandas DataFrame containing numerical columns as well as string columns (dtype is object), and would like to remove the rows containing outliers with respect to the distributions within a column. In other words, detect the outliers in each column and drop the corresponding rows.
I have found two solutions to this, but neither takes into account that my df does not contain only numbers, hence they both result in errors (when encountering strings, I assume).
from scipy import stats
df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]
returns TypeError: unsupported operand type(s) for /: 'str' and 'int'.
This is why I guess the error arises from the df having mixed dtypes.
for col in df.columns:
lower = df[col].quantile(0.05)
upper = df[col].quantile(0.95)
df = df[col].clip(lower=lower, upper=upper)
returns KeyError with this traceback:
File omissis, in Class.remove_outliers(self, df)
423 def remove_outliers(self, df):
424 for col in df.columns:
--> 425 lower = df[col].quantile(0.05)
426 upper = df[col].quantile(0.95)
427 df = df[col].clip(lower=lower, upper=upper)
File omissis, in Series.__getitem__(self, key)
955 return self._values[key]
957 elif key_is_scalar:
--> 958 return self._get_value(key)
960 if is_hashable(key):
961 # Otherwise index.get_value will raise InvalidIndexError
962 try:
963 # For labels that don't resolve as scalars like tuples and frozensets
File omissis, in Series._get_value(self, label, takeable)
1066 return self._values[label]
1068 # Similar to Index.get_value, but we do not fall back to positional
-> 1069 loc = self.index.get_loc(label)
1070 return self.index._get_values_for_loc(self, loc, label)
File omissis, in RangeIndex.get_loc(self, key, method, tolerance)
387 raise KeyError(key) from err
388 self._check_indexing_error(key)
--> 389 raise KeyError(key)
390 return super().get_loc(key, method=method, tolerance=tolerance)
KeyError: 'colname'
How would you solve this?
EDIT: the idea is to skip the non numeric columns, to ignore them.
>Solution :
I would break the problem into stages:
Firstly, identify (numeric) columns you want to do the outlier removal.
Reference
newdf = df.select_dtypes(include=np.number)
Now perform whatever filtering/outlier removal you want on the rows of newdf. Afterwards, newdf should contain only rows you wish to retain.
Then keep only the rows of df those index are in newdf. Reference
df = df[df.index.isin(newdf.index)]