Get a particular text from a column in a dataframe

August 30, 2022

I have a dataset where I have a dataframe[‘Title’] column with car brand and accessories information.

I want to 2 new columns dataframe[‘Brand’] and dataframe[‘model’], where I want to get the brand name of the vehicle and the model of the vehicle. Like Mahindra brand and XUV300 as per the first record. and want brand –> universal and model –> NaN if the record is like the second entry –> Blaupunkt Colombo 130 BT.

What I tried:-

brand = []
for i in vehicle_make:
    for j in range(len(df2['Title'])):
        text = df2['Title'][j].lstrip().lower()
        print(text)
        if i in text:
            df2['brand'][j] = i
            print("yes")
        else:
            df2['brand'][j] = 'Unversal'
            print('No')

where vehicle_make contains brand names.

['ford',
 'honda',
 'hyundai',
 'Kia',
 'mahindra',
 'maruti',
 'mg',
 'nissan',
 'renault',
 'skoda',
 'tata',
 'toyota',
 'volkswagen']

which I scrapped from the same website.

The above code is working but it’s not picking all the values.

mahindra xuv 300 led door foot step sill plate mirror finish black glossy
No
blaupunkt colombo 130 bt
No
nissan terrano mud flap /mud guard
No
mg hector plus 4d boss leatherite car floor mat black( without grass mat)
No
ford endeavour body protection waterproof car cover (grey)
yes
starid wiper blade framless for volkswagen polo (size 24' and 16'' ) black
No
mahindra tuv300 rain door visor without chrome line (2015-2017)
No

This is the output I am getting. What is wrong here?

>Solution :

This answer assumes that all text following a brand name would correspond to the model. We can form a regex alternation of brand names, and then use str.extract with this alternation.

brands = ['ford', 'honda', 'hyundai', 'Kia', 'mahindra', 'maruti', 'mg', 'nissan', 'renault', 'skoda', 'tata', 'toyota', 'volkswagen']
regex = r'\b(' + r'|'.join(brands) + r') (.*)'
df2[["Title", "Brand"]] = df2["Title"].str.extract(regex, flags=re.I)