Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to one hot encode the products of an unorganised market basket dataframe

The dataframe I am talking about
is this

I am interested in only a subset of the products and I want to transform the data so instead of having "item" columns I have columns with the names of the products I am interested in with values of 0 or 1 indicating whether or not the said product is in the basket. What I have done so far is this

data_list = []
products = ['citrus fruit', 'tropical fruit', 'whole milk', 'other vegetables', 'rolls/buns', 'chocolate', 'bottled water', 'yogurt',
'sausage', 'root vegetables', 'pastry', 'soda', 'cream']
for i in range(raw_data.shape[0]):
    entry = []
    # discretize basket_value
    if(raw_data.values[i, 1] < 5):
        entry.append('low_value_basket')
    elif(raw_data.values[i, 1] < 10):
        entry.append('medium_value_basket')
    else:
        entry.append('high_value_basket')
    # add recency_days
    entry.append(raw_data.values[i, 2])
    # add columns for the products specified with values of 1 (indicating the product is in the basket) and 0 (indicating it is not)
    flag = False # flag used to determine if any of the specified products is in the basket
    for j in range(len(products)):
        value = 1 if products[j] in raw_data.values[i, 3:] else 0
        entry.append(value)
        if(value == 1) :
            flag = True
    if(flag):
        data_list.append(entry)

However this takes several minutes to run (the raw_data has around 7,500 entries) and I was wondering if there was any faster way of doing it.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

Use:

df = pd.DataFrame([['ab', 'bv', 'cc'], ['cc'], ['dv', 'ab', 'ac', 'ff']], columns = ['i1', 'i2', 'i3', 'i4'])
corpus = df.apply(lambda x: ' '.join(x.to_numpy().astype(str)), axis=1).values
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(min_df=0, use_idf = False)
X = vectorizer.fit_transform(corpus)
temp = X.toarray()>0
temp.astype(int)

Output:

enter image description here

It is bag of items. If we print the following comment, it can be more clear:

vectorizer.get_feature_names()

Output:

['ab', 'ac', 'bv', 'cc', 'dv', 'ff', 'none']

We can see that the ‘ab’ item is present in the first basket and the second is not, and so on. Based on the data provided, I rewrite the answer:

df = pd.read_csv('GroceriesInitial.csv')
df = df.loc[:, [x for x in df.columns if 'Item' in x]]
corpus = df.apply(lambda x: ' '.join(x.to_numpy().astype(str)), axis=1).values
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(min_df=0, use_idf = False)
X = vectorizer.fit_transform(corpus)
temp = X.toarray()>0
temp.astype(int)

Output:

enter image description here

and corresponding items:

enter image description here

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading