Split strings into different columns not working correctly

May 10, 2022

I am working with a large dataset with a column for reviews which is comprised of a series of strings for example: "A,B,C" , "A,B*,B" etc..

for example,

import pandas as pd

df=pd.DataFrame({'cat1':[1,2,3,4,5],
                 'review':['A,B,C', 'A,B*,B,C', 'A,C', 'A,B,C,D', 'A,B,C,A,B']})
df2 = df["review"].str.split(",",expand = True)
df.join(df2)

I want to split that column up into separate columns for each letter, then add those columns into the original data frame. I used df2 = df["review"].str.split(",",expand = True) and df.join(df2) to do that.

However, when i use df["A"].unique() there are entries that should not be in the column. I only want ‘A’ to appear there, but there is also B and C. Also, B and B* are not splitting into two columns.

My dataset is quite large so I don’t know how to properly illustrate this problem, I have tried to provide a small scale example, however, everything seems to be working correctly in this example;

I have tried to look through the original column with df['review'].unique() and all entries were entered correctly (no missing commas or anything like that), so I was wondering if there is something wrong with my approach that would influence it to not work correctly across all datasets. Or is there something wrong with my dataset.

Does anyone have any suggestions as to how I should troubleshoot?

>Solution :

when i use df["A"].unique() there are entries that should not be in the column. I only want ‘A’ to appear there

IIUC, you wanted to create dummy variables instead?

df2 = df.join(df['review'].str.get_dummies(sep=',').pipe(lambda x: x*[*x]).replace('',float('nan')))

Output:

   cat1     review  A    B   B*  C    D
0     1      A,B,C  A    B  NaN  C  NaN
1     2   A,B*,B,C  A    B   B*  C  NaN
2     3        A,C  A  NaN  NaN  C  NaN
3     4    A,B,C,D  A    B  NaN  C    D
4     5  A,B,C,A,B  A    B  NaN  C  NaN