I am trying to remove all "M. " that appears in the beginning of a column. This is supposed to be easy. Here is my code:
df['name'] = df['name'].str.replace('M. ', "", regex=True)
Here is a sample of my data:
name
M. ABAD John
M. BOULMÉ Jean
Mme BONO-VANDORME Anne
This is what I am obtaining:
name
ABAD Jogn
BOULJean
Mme BONO-VANDORAnne
I find this result very weird. It seems that python is confusing "E" with ".". Why is this happening? How should I correct the code?
>Solution :
Pandas str.replace()
method is different from the Python Built-in str.replace()
in that str.replace considers its first argument as a regular expression.
In regular expression the dot .
represents any single character, therefore the string ME
matches the regular expression M.
Therefore the solution in your case would be to disable treating the first argument as a regular expression.
With regex=False
str.replace
would perform a normal character string substitution.
df['name'].str.replace('M. ', '', regex=False)
Note that in the latest versions of Pandas (since pandas 2.0) regex=False
is the default, so you could just avoid this optional argument altogether. Yet beware that in earlier versions the default was exactly the opposite.