This is not the best approach but this what I did so far:
I have this example df:
df = pd.DataFrame({
'City': ['I lived Los Angeles', 'I visited London and Toronto','the best one is Toronto', 'business hub is in New York',' Mexico city is stunning']
})
df
gives:
City
0 I lived Los Angeles
1 I visited London and Toronto
2 the best one is Toronto
3 business hub is in New York
4 Mexico city is stunning
I am trying to match (case insensitive) city names from a nested dic and create a new column with the country name with int values for statistical purposes.
So, here is my nested dic as a reference for countries and cities:
country = { 'US': ['New York','Los Angeles','San Diego'],
'CA': ['Montreal','Toronto','Manitoba'],
'UK': ['London','Liverpool','Manchester']
}
and I created a function that should look for the city from the df and match it with the dic, then create a column with the country name:
def get_country(x):
count = 0
for k,v in country.items():
for y in v:
if y.lower() in x:
df[k] = count + 1
else:
return None
then applied it to df:
df.City.apply(lambda x: get_country(x.lower()))
I got the following output:
City US
0 I lived Los Angeles 1
1 I visited London and Toronto 1
2 the best one is Toronto 1
3 business hub is in New York 1
4 Mexico city is stunning 1
Expected output:
City US CA UK
0 I lived Los Angeles 1 0 0
1 I visited London and Toronto 0 1 1
2 the best one is Toronto 0 1 0
3 business hub is in New York 1 0 0
4 Mexico city is stunning 0 0 0
>Solution :
You can achieve this result using a lambda function to check if any city for each country is contained in the string, after first lower-casing the city names in country:
cl = { k : list(map(str.lower, v)) for k, v in country.items() }
for ctry, cities in cl.items():
df[ctry] = df['City'].apply(lambda s:any(c in s.lower() for c in cities)).astype(int)
Output:
City US CA UK
0 I lived Los Angeles 1 0 0
1 I visited London and Toronto 0 1 1
2 the best one is Toronto 0 1 0
3 business hub is in New York 1 0 0
4 Mexico city is stunning 0 0 0