Vectorized str.replace for multiple characters in pandas

June 10, 2022

I have a dataframe:

 {'country': {0: 'Afghanistan?*', 1: 'Albania?*'},
 'region': {0: 'Asia', 1: 'Europe'},
 'subregion': {0: 'Southern Asia', 1: 'Southern Europe'},
 'rate_per_1000': {0: 6.7, 1: 2.1},
 'count': {0: '2,474', 1: '61'},
 'year': {0: 2018, 1: 2020},
 'source': {0: 'NSO', 1: 'NSO'}}

          country  region        subregion  rate_per_1000  count  year source
0   Afghanistan?*    Asia    Southern Asia            6.7  2,474  2018    NSO
1       Albania?*  Europe  Southern Europe            2.1     61  2020    NSO

There are multiple bad characters here that I want to get rid of. I made a short function for .apply() to get rid of them, however I am looping over a defined list of bad characters. This gives a bad code smell to me, I think this operation could be more vectorized in some way. This is what I’ve tried:

bad_chars = ['?', '*', ',']

def string_cleaner(col):
    if col.dtype == 'object':
        for char in bad_chars:
            col = col.str.replace(f'{char}', '')
        return col

homicide_by_country = homicide_by_country.apply(string_cleaner)

homicide_by_country
        country  region        subregion rate_per_1000 count  year source
0   Afghanistan    Asia    Southern Asia          None  2474  None    NSO
1       Albania  Europe  Southern Europe          None    61  None    NSO

My desired outcome is a more pythonic/pandonic technique for accomplishing the same outcome.

edit: You may notice for some reason my rate_per_1000 columns goes blank. I haven’t troubleshot that problem yet but if you spot something obvious I’m all ears.

>Solution :

Seems like you need df.replace with regex=True

import re
>>> df.replace('|'.join(map(re.escape, bad_chars)),'', regex=True)

Notice that this will keep the same dtypes of your columns, so no need to worry about numeric cols.

Also, note that you need a special treatment of your regex because ?, * etc are special characters in regular expressions, so you need to escape these chars.

In more detail, re.escape will basically add a \ in front of every character in bad_chars, as long as it’s a special character. This is necessary as to make sure that special character is meant literally as is.

Take this example from the Python docs:

The first metacharacter for repeating things that we’ll look at is *. 
* doesn’t match the literal character '*'; 
instead, it specifies that the previous character can be 
matched zero or more times, instead of exactly once.

This means that if you just use df.replace("*", "", regex=True), the engine will interpret * as a metacharacter, not as an asterisk, which will yield weird results. However, if you do df.replace("\*", "", regex=True), then * will be interpreted as an asterisk. "Escaping" means this action of adding a \ to indicate you mean the literal character.

Now, | is also a metacharacter that means "join", or "or". Here, we do NOT want to escape the pipe, because we want it interpreted with its regexp power.

The expression below