Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

filtering a pandas dataframe string column using the `str.contains` method

I have dataframe looking like this, where long_category reflects the category of businesses in the rows:

df = pd.DataFrame({
 'long_category': {0: 'Doctors, Traditional Chinese Medicine, Naturopathic/Holistic, Acupuncture, Health & Medical, Nutritionists',
  1: 'Shipping Centers, Local Services, Notaries, Mailbox Centers, Printing Services',
  2: 'Department Stores, Shopping, Fashion, Home & Garden, Electronics, Furniture Stores',
  3: 'Restaurants, Food, Bubble Tea, Coffee & Tea, Bakeries',
  4: 'Brewpubs, Breweries, Food',
  5: 'Burgers, Fast Food, Sandwiches, Food, Ice Cream & Frozen Yogurt, Restaurants',
  6: 'Sporting Goods, Fashion, Shoe Stores, Shopping, Sports Wear, Accessories',
  7: 'Synagogues, Religious Organizations',
  8: 'Pubs, Restaurants, Italian, Bars, American (Traditional), Nightlife, Greek',
  9: 'Ice Cream & Frozen Yogurt, Fast Food, Burgers, Restaurants, Food'}})

df:

enter image description here

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

My goal is to shorten those categories into the below known_categories based on whether or not a long category contains a keyword in the known_categories:

  known_categories = ['restaurant', 'beauty & spas', 'hotels', 'health & medical', 'shopping', 'coffee & tea','automotive', 'pets|veterinian', 'services', 'stores',  'grocery', 'ice cream']

So, I do:

df['short_category'] = np.nan
for cat in known_categories:
    excluded_cats = [x for x in known_categories if x!= cat]
    df['short_category'] [ ~(df.long_category.str.contains('|'.join(excluded_cats), regex = True, case = False, na = False)) & (df.long_category.str.contains(cat, regex = True, case = False, na = False))] = cat

It is important that the short categories are mutually exclusive. For example, row indexed 3 should be put in either "restaurant" or "coffee & tea" short category, hence the ~(df.long_category.str.contains('|'.join(excluded_cats), regex = True, case = False, na = False)) condition above.

But this is not working as can be seen below. For example, both of the last two rows have "restaurant" in their long category but only the first one’s short category has captured that. I was expecting the last one to be either ‘restaurant’ or ‘ice cream’ because it has those keywords from short_categories. So, where have I gone wrong? As a side note, I would like to be able to impact the frequency of the short categories by moving the category further to the front or back of known_categories if possible. For example, with the current short_categories, I’d like the last row to have ‘restaurant’ as short category. But if I move ‘ice cream’ before ‘restaurant’ in short_categories, I would like the last row to show ‘ice cream’ and not ‘restaurant’

enter image description here

>Solution :

Go through your categories list backwards (assuming more specific categories come later) and only set it as the short category if no previous category has been assigned:

df['short_category'] = ''

for cat in known_categories[::-1]:
    contains_cat = df['long_category'].str.contains(cat, case = False, regex = True)
    no_category = df['short_category'] == ''

    df.loc[contains_cat & no_category, 'short_category'] = cat
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading