How to reduce dataframe column strings into core variations?

Advertisements

I’ve got a dataframe column which represents the order in which fruit was bought at a supermarket. The dataframe looks something like this:

mydict ={
        'customer': ['Jack', 'Danny', 'Alex'],
        'fruit_bought': ['apple#orange#apple', 'orange#apple', 'apple#banana#banana'],
    }

df = pd.DataFrame(mydict) 

customer | fruit_bought
-----------------------------
Jack     | apple#orange#apple
Danny    | orange#apple
Alex     | apple#banana#banana

What I’d like to do is reduce the strings into the combination of unique fruit that the customer bought, which would look like this:

customer | fruit_bought
---------------------
Jack     | apple#orange
Danny    | apple#orange
Alex     | apple#banana

I’m sure I can put together a long-winded apply function to help with this, but I’m looking at 200,000 rows of data so I’d rather avoid using apply here in favour of a vectorized approach. Can anyone please help me with this?

>Solution :

You can use map

>>> df = pd.DataFrame(mydict)
>>> df
  customer         fruit_bought
0     Jack   apple#orange#apple
1    Danny         orange#apple
2     Alex  apple#banana#banana
>>> df['Unique'] = df.fruit_bought.str.split('#').map(set).str.join('#')
>>> df
  customer         fruit_bought        Unique
0     Jack   apple#orange#apple  apple#orange
1    Danny         orange#apple  apple#orange
2     Alex  apple#banana#banana  apple#banana

Leave a ReplyCancel reply