Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Split strings into different columns not working correctly

I am working with a large dataset with a column for reviews which is comprised of a series of strings for example: "A,B,C" , "A,B*,B" etc..

for example,

import pandas as pd

df=pd.DataFrame({'cat1':[1,2,3,4,5],
                 'review':['A,B,C', 'A,B*,B,C', 'A,C', 'A,B,C,D', 'A,B,C,A,B']})
df2 = df["review"].str.split(",",expand = True)
df.join(df2)

I want to split that column up into separate columns for each letter, then add those columns into the original data frame. I used df2 = df["review"].str.split(",",expand = True) and df.join(df2) to do that.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

However, when i use df["A"].unique() there are entries that should not be in the column. I only want ‘A’ to appear there, but there is also B and C. Also, B and B* are not splitting into two columns.

My dataset is quite large so I don’t know how to properly illustrate this problem, I have tried to provide a small scale example, however, everything seems to be working correctly in this example;

I have tried to look through the original column with df['review'].unique() and all entries were entered correctly (no missing commas or anything like that), so I was wondering if there is something wrong with my approach that would influence it to not work correctly across all datasets. Or is there something wrong with my dataset.

Does anyone have any suggestions as to how I should troubleshoot?

>Solution :

when i use df["A"].unique() there are entries that should not be in the column. I only want ‘A’ to appear there

IIUC, you wanted to create dummy variables instead?

df2 = df.join(df['review'].str.get_dummies(sep=',').pipe(lambda x: x*[*x]).replace('',float('nan')))

Output:

   cat1     review  A    B   B*  C    D
0     1      A,B,C  A    B  NaN  C  NaN
1     2   A,B*,B,C  A    B   B*  C  NaN
2     3        A,C  A  NaN  NaN  C  NaN
3     4    A,B,C,D  A    B  NaN  C    D
4     5  A,B,C,A,B  A    B  NaN  C  NaN
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading