Unable to Filter based on Substring in Pandas

August 9, 2023

There is a dataset in this form:

company_url         Name                  Revenue
mackter.com         Mack Sander           NaN
nientact.com        Neient Dan            321
ventienty.com       Richard               NaN

So, my task here is to remove all the rows where string ‘tac’, ‘bux’ or ‘mvy’ is coming in either ‘company_url’ or ‘Name’ column…. As you can see, ‘tac’ is present in nientact.com , so the row should get deleted… Similarly, all the rows where any of these 3 string are present in either company_url or Name, the rows should get deleted…. SO, Initially I tried it for company_url column and written the below code, but it’s showing error.

lists=['tac', 'bux', 'mvy']
for i in lists:
    df = df[~df['company_url].str.contains(i)]

but its showing
TypeError: unhashable type: ‘list’

>Solution :

You can craft a regex to use with str.contains, then aggregate with any, invert with ~, and perform boolean indexing:

import re

lists = ['tac', 'bux', 'mvy']
pattern = '|'.join(map(re.escape, lists))
# 'tac|bux|mvy'

out = df[~df[['company_url', 'Name']]
          .apply(lambda s: s.str.contains(pattern, case=False))
                            .any(axis=1)
        ]

Output:

     company_url         Name  Revenue
0    mackter.com  Mack Sander      NaN
2  ventienty.com      Richard      NaN

Just for info, as this is inefficient, a fix of your loop:

lists=['tac', 'bux', 'mvy']
for i in lists:
    df = df[~df[['company_url', 'Name']]
               .apply(lambda s: s.str.contains(i))
               .any(axis=1)]