Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Vectorized str.replace for multiple characters in pandas

I have a dataframe:

 {'country': {0: 'Afghanistan?*', 1: 'Albania?*'},
 'region': {0: 'Asia', 1: 'Europe'},
 'subregion': {0: 'Southern Asia', 1: 'Southern Europe'},
 'rate_per_1000': {0: 6.7, 1: 2.1},
 'count': {0: '2,474', 1: '61'},
 'year': {0: 2018, 1: 2020},
 'source': {0: 'NSO', 1: 'NSO'}}

          country  region        subregion  rate_per_1000  count  year source
0   Afghanistan?*    Asia    Southern Asia            6.7  2,474  2018    NSO
1       Albania?*  Europe  Southern Europe            2.1     61  2020    NSO

There are multiple bad characters here that I want to get rid of. I made a short function for .apply() to get rid of them, however I am looping over a defined list of bad characters. This gives a bad code smell to me, I think this operation could be more vectorized in some way. This is what I’ve tried:

bad_chars = ['?', '*', ',']

def string_cleaner(col):
    if col.dtype == 'object':
        for char in bad_chars:
            col = col.str.replace(f'{char}', '')
        return col

homicide_by_country = homicide_by_country.apply(string_cleaner)
homicide_by_country
        country  region        subregion rate_per_1000 count  year source
0   Afghanistan    Asia    Southern Asia          None  2474  None    NSO
1       Albania  Europe  Southern Europe          None    61  None    NSO

My desired outcome is a more pythonic/pandonic technique for accomplishing the same outcome.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

edit: You may notice for some reason my rate_per_1000 columns goes blank. I haven’t troubleshot that problem yet but if you spot something obvious I’m all ears.

>Solution :

Seems like you need df.replace with regex=True

import re
>>> df.replace('|'.join(map(re.escape, bad_chars)),'', regex=True)

Notice that this will keep the same dtypes of your columns, so no need to worry about numeric cols.

Also, note that you need a special treatment of your regex because ?, * etc are special characters in regular expressions, so you need to escape these chars.


In more detail, re.escape will basically add a \ in front of every character in bad_chars, as long as it’s a special character. This is necessary as to make sure that special character is meant literally as is.

Take this example from the Python docs:

The first metacharacter for repeating things that we’ll look at is *. 
* doesn’t match the literal character '*'; 
instead, it specifies that the previous character can be 
matched zero or more times, instead of exactly once.

This means that if you just use df.replace("*", "", regex=True), the engine will interpret * as a metacharacter, not as an asterisk, which will yield weird results. However, if you do df.replace("\*", "", regex=True), then * will be interpreted as an asterisk. "Escaping" means this action of adding a \ to indicate you mean the literal character.

Now, | is also a metacharacter that means "join", or "or". Here, we do NOT want to escape the pipe, because we want it interpreted with its regexp power.

The expression below

'|'.join(map(re.escape, bad_chars))

will yield something like

\?|\*|,

which means we want to replace all \? and \* and , with empty strings.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading