Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

re.sub() exception for email addresses – python

i have a question, with the following re.sub() method i am able to extract all mail addresses from a *.txt file.

emails = re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", file)

Now, i’d like to remove all punctuation marks from this *.txt, because there is also some text in it.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

I have removed the punctuation marks with

output = re.sub(r'^\w\s', '', file)

but this function also removes the punctuation marks from the email addresses in the text. How do i write an exception in this re.sub for the mail addresses?

Thank you.

>Solution :

You can use

re.sub(r"([a-z0-9.\-+_]+@[a-z0-9.\-+_]+\.[a-z]+)|[^\w\s]", r"\1", file)

Here, the email pattern is captured into Group 2 and the \1 backreference in the replacement pattern restores the email text in the resulting string.

Note [^\w\s] matches any char other than a word and whitespace chars, and thus does not match an underscore. If you want to remove underscores, too, add it as an alternative:

re.sub(r"([a-z0-9.\-+_]+@[a-z0-9.\-+_]+\.[a-z]+)|[^\w\s]|_", r"\1", file)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading