Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Regex – Match a string up to a digit or a specific string

I am working in python and there I have a list of countries that I would like to clean. Most countries are already written the way I want them to be. However, some country names have a one- or two-digit number attached or there is a text in brackets appended. Here’s a sample of that list:

Argentina
Australia1
Bolivia (Plurinational State of)
China, Hong Kong Special Administrative Region
Côte d'Ivoire
Curaçao
Guinea-Bissau
Indonesia8

The part that I want to capture would look like this:

Argentina
Australia
Bolivia
China, Hong Kong Special Administrative Region
Côte d'Ivoire
Curaçao
Guinea-Bissau
Indonesia

The best solution that I was able to come up with is ^[a-zA-Z\s,ô'ç-]+. However, this leaves country names that are followed by a text in parentheses with a trailing white space.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

This means I would like to match the entire country name unless there is a digit or a white space followed by an open bracket, then I would like it to stop before the digit or the (

I know that I could probably solve this in two steps but I am also reasonably sure that it should be possible to define a pattern that can do it in one step. Since I am anyway in the process of getting familiar with regex, I thought this would be a nice thing to know.

>Solution :

You can test the regex here https://regex101.com/r/dupn18/1
This should do the trick

In [1]: import re

In [2]: pattern = re.compile(r'(.+(?=\d| \()|.+)')

In [3]: data = """Argentina
   ...: Australia1
   ...: Bolivia (Plurinational State of)
   ...: China, Hong Kong Special Administrative Region
   ...: Côte d'Ivoire
   ...: Curaçao
   ...: Guinea-Bissau
   ...: Indonesia8""".splitlines()

In [4]: [pattern.search(country).group() for country in data]
Out[4]:
['Argentina',
 'Australia',
 'Bolivia',
 'China, Hong Kong Special Administrative Region',
 "Côte d'Ivoire",
 'Curaçao',
 'Guinea-Bissau',
 'Indonesia']
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading