Keyword categorization from strings in a new column in pandas

June 19, 2022

This is not the best approach but this what I did so far:

I have this example df:

df = pd.DataFrame({
                  'City': ['I lived Los Angeles', 'I visited London and Toronto','the best one is Toronto', 'business hub is in New York',' Mexico city is stunning']
                   })

df

gives:

    City
0   I lived Los Angeles
1   I visited London and Toronto
2   the best one is Toronto
3   business hub is in New York
4   Mexico city is stunning

I am trying to match (case insensitive) city names from a nested dic and create a new column with the country name with int values for statistical purposes.

So, here is my nested dic as a reference for countries and cities:

country = { 'US': ['New York','Los Angeles','San Diego'],
         'CA': ['Montreal','Toronto','Manitoba'],
         'UK': ['London','Liverpool','Manchester']      
       }

and I created a function that should look for the city from the df and match it with the dic, then create a column with the country name:

def get_country(x):
    count = 0
    for k,v in country.items():
        for y in v:
            if y.lower() in x:
                df[k] = count + 1
            else:
                return None

then applied it to df:

df.City.apply(lambda x: get_country(x.lower()))

I got the following output:

    City                              US
0   I lived Los Angeles               1
1   I visited London and Toronto      1
2   the best one is Toronto           1
3   business hub is in New York       1
4   Mexico city is stunning           1

Expected output:

    City                              US    CA   UK
0   I lived Los Angeles               1      0    0
1   I visited London and Toronto      0      1    1
2   the best one is Toronto           0      1    0
3   business hub is in New York       1      0    0
4   Mexico city is stunning           0      0    0

>Solution :

You can achieve this result using a lambda function to check if any city for each country is contained in the string, after first lower-casing the city names in country:

cl = { k : list(map(str.lower, v)) for k, v in country.items() }
for ctry, cities in cl.items():
    df[ctry] = df['City'].apply(lambda s:any(c in s.lower() for c in cities)).astype(int)

Output:

                           City  US  CA  UK
0           I lived Los Angeles   1   0   0
1  I visited London and Toronto   0   1   1
2       the best one is Toronto   0   1   0
3   business hub is in New York   1   0   0
4       Mexico city is stunning   0   0   0