Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to apply onehot encoder over vectorized dataframe columns?

Suppose that we have this data frame:

ID CATEGORIES
0 [‘A’]
1 [‘A’, ‘C’]
2 [‘B’, ‘C’]

And I want to apply one hot encoder to categories column. The result I want is

ID A B C
0 1 0 0
1 1 0 1
2 0 1 1

I know it can be easily codded. I just want to know if this function is already implemented in some package. Code it in python will probably result in a quite slow function.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

(i needed to put the tables in code fields because stackoverflow was not allowing me to post it as tables)

>Solution :

You can use str.join combined with str.get_dummies:

out = df[['ID']].join(df['CATEGORIES'].str.join('|').str.get_dummies())

Output:

   ID  A  B  C
0   0  1  0  0
1   1  1  0  1
2   2  0  1  1

used input:

df = pd.DataFrame({'ID': [0, 1, 2],
                   'CATEGORIES': [['A'], ['A', 'C'], ['B', 'C']]})

There are many other alternatives, using pivot, crosstab, etc.

One example:

df2 = df.explode('CATEGORIES')

out = pd.crosstab(df2['ID'], df2['CATEGORIES']).reset_index()
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading