Duplicate substring removal from list

I have a dataframe with a product_type column that has duplicate substrings within strings:


tote bag,bag


I’m using this line to remove to create a new column "unique_type" the duplicate substrings

df_1['unique_type'] = [set(sub.split(',')) for sub in df_1["product_type"]]

This is what the new dataframe looks like

current output

product_type         unique_type
bag,bag              {'bag'}
tote bag, bag        {'tote bag', 'bag'}
handbag, handbag     {'handbag'}

The problem is that the strings in the new column unique_type has curly brackets and quotation marks. I would like to produce a column that has strings without curly brackets and quotation marks like so:

desired output

product_type         unique_type
bag,bag              bag
tote bag, bag        tote bag, bag
handbag, handbag     handbag

>Solution :

Add join:

df_1['unique_type'] = [', '.join(set(sub.split(','))) for sub in df_1["product_type"]]

Or if need same order of values use dict.fromkeys trick:

df_1['unique_type1'] = [', '.join(dict.fromkeys(sub.split(',')))
                                                     for sub in df_1["product_type"]]

print (df_1)
      product_type    unique_type   unique_type1
0          bag,bag            bag            bag
1     tote bag,bag  bag, tote bag  tote bag, bag
3  handbag,handbag        handbag        handbag

Leave a Reply