Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

pandas categorical remove categories from multiple columns

I have many categorical columns like:

df = pd.DataFrame(
    data={
        "id": [1, 2, 3, 4],
        "category1": [" ",
                      "data",
                      "more data",
                      "         "],
        "category2": ["   ", "more data", " ", "and more"],
    }
)
df["category1"] = df["category1"].astype("category")
df["category2"] = df["category2"].astype("category")

I want to remove any levels of the categorical type columns that only have whitespace, while ensuring they remain categories (can’t use .str in other words). I have tried:

cat_cols = df.select_dtypes("category").columns
for c in cat_cols:
    levels = [level for level in df[c].cat.categories.values.tolist()
              if level.isspace()]
    df[c] = df[c].cat.remove_categories(levels)

This works, so I tried making it faster and neater with list-comprehension like so:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

df[cat_cols] = [df[c].cat.remove_categories(
                [level for level in df[c].cat.categories.values.tolist()
                if level.isspace()])
                for c in cat_cols]

At which point I get "ValueError: Columns must be same length as key"

Note, I don’t want to use inplace parameter in the list-comp because it is going to be deprecated for pd.Categorical.

Feel like I might be missing something basic here, but how do I do this with a list-comprehension and not use inplace?

>Solution :

You can use dictionary comprehension with DataFrame constructor:

df[cat_cols] = pd.DataFrame({c: df[c].cat.remove_categories(
                [level for level in df[c].cat.categories.values.tolist()
                if level.isspace()])
                for c in cat_cols})

print (df)
   id  category1  category2
0   1        NaN        NaN
1   2       data  more data
2   3  more data        NaN
3   4        NaN   and more
    

Or use concat:

df[cat_cols] = pd.concat([df[c].cat.remove_categories(
                [level for level in df[c].cat.categories.values.tolist()
                if level.isspace()])
                for c in cat_cols], axis=1)

print (df)
   id  category1  category2
0   1        NaN        NaN
1   2       data  more data
2   3  more data        NaN
3   4        NaN   and more
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading