Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Keyword categorization from strings in a new column in pandas

This is not the best approach but this what I did so far:

I have this example df:

df = pd.DataFrame({
                  'City': ['I lived Los Angeles', 'I visited London and Toronto','the best one is Toronto', 'business hub is in New York',' Mexico city is stunning']
                   })

df

gives:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

    City
0   I lived Los Angeles
1   I visited London and Toronto
2   the best one is Toronto
3   business hub is in New York
4   Mexico city is stunning

I am trying to match (case insensitive) city names from a nested dic and create a new column with the country name with int values for statistical purposes.

So, here is my nested dic as a reference for countries and cities:

country = { 'US': ['New York','Los Angeles','San Diego'],
         'CA': ['Montreal','Toronto','Manitoba'],
         'UK': ['London','Liverpool','Manchester']      
       }

and I created a function that should look for the city from the df and match it with the dic, then create a column with the country name:

def get_country(x):
    count = 0
    for k,v in country.items():
        for y in v:
            if y.lower() in x:
                df[k] = count + 1
            else:
                return None
            

then applied it to df:

df.City.apply(lambda x: get_country(x.lower()))

I got the following output:

    City                              US
0   I lived Los Angeles               1
1   I visited London and Toronto      1
2   the best one is Toronto           1
3   business hub is in New York       1
4   Mexico city is stunning           1

Expected output:

    City                              US    CA   UK
0   I lived Los Angeles               1      0    0
1   I visited London and Toronto      0      1    1
2   the best one is Toronto           0      1    0
3   business hub is in New York       1      0    0
4   Mexico city is stunning           0      0    0

>Solution :

You can achieve this result using a lambda function to check if any city for each country is contained in the string, after first lower-casing the city names in country:

cl = { k : list(map(str.lower, v)) for k, v in country.items() }
for ctry, cities in cl.items():
    df[ctry] = df['City'].apply(lambda s:any(c in s.lower() for c in cities)).astype(int)

Output:

                           City  US  CA  UK
0           I lived Los Angeles   1   0   0
1  I visited London and Toronto   0   1   1
2       the best one is Toronto   0   1   0
3   business hub is in New York   1   0   0
4       Mexico city is stunning   0   0   0
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading