Assume, there is one DataFrame such as following
import pandas as pd
import numpy as np
df = pd.DataFrame({'id':range(1,4),
'items':[['A', 'B'], ['A', 'B', 'C'], ['A', 'C']]})
df
id items
1 [A, B]
2 [A, B, C]
3 [A, C]
Is there an efficient way to convert above DataFrame into the following (one-hot encoded columns)? Many Thanks in advance!
id items A B C
1 [A, B] 1 1 0
2 [A, B, C] 1 1 1
3 [A, C] 1 0 1
>Solution :
Another possible solution, whose steps are:
-
First, the
explodefunction is used to transform each item of a list-like to a row, replicating the index values. -
Then, the
pivot_tablefunction is applied to reshape the data based on the unique values in theitemscolumn, aggregating the count of eachidfor everyitem. Thefill_value=0ensures that any missing combinations are filled with zeros. -
The
rename_axismethod is used to remove the axis name for the columns. -
Finally,
reset_indexis called to reset the index of the dataframe, turning the index into a column. -
The original dataframe
dfis then merged with this transformed dataframe using themergefunction.
df.merge(
df.explode('items')
.pivot_table(index='id', columns='items', values='id', aggfunc=len,
fill_value=0)
.rename_axis(None, axis=1).reset_index())
Output:
id items A B C
0 1 [A, B] 1 1 0
1 2 [A, B, C] 1 1 1
2 3 [A, C] 1 0 1