Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

What is the best way to match optional whole words with python regex

I use regualr expressions frequently, but often in the same similar ways. I sometimes run across this scenario where I’d like to capture strings with optional whole words in them. I’ve come up with the method below but I suspect there’s a better way, just not sure what it is? An example is a string like this:

For the purposes of this order, the sum of $5,476,958.00 is the estimated total costs of the initial unit well covered hereby as dry hole and for the purposes of this order, the sum of $12,948,821.00 is the estimated total costs of such initial unit well as a producing well

My goal is to capture both portions of the string beginning with the dollar sign $ and ending with either word dry or prod. In the example the whole word is producing, but sometimes it’s a variation of the word such as production, so prod is fine. The captured results should be:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

['$5,476,958.00 is the estimated total costs of the initial unit well covered hereby as dry', '$12,948,821.00 is the estimated total costs of such initial unit well as a prod']

which I get with this not so elegant expression:
[val[0] for val in re.findall('(\$[0-9,\.]+[a-z ,]+total cost.*?(dry|prod)+)', line, flags=re.IGNORECASE)]

Is there a better, more correct, way to accomplish it than this?

>Solution :

We can use re.findall here:

inp = "For the purposes of this order, the sum of $5,476,958.00 is the estimated total costs of the initial unit well covered hereby as dry hole and for the purposes of this order, the sum of $12,948,821.00 is the estimated total costs of such initial unit well as a producing well"
matches = re.findall(r'\$\d{1,3}(?:,\d{3})*(?:\.\d+)?.*?\b(?:dry|prod)', inp)
print(matches)

This prints:

['$5,476,958.00 is the estimated total costs of the initial unit well covered hereby as dry',
 '$12,948,821.00 is the estimated total costs of such initial unit well as a prod']

Here is an explanation of the regex pattern being used:

  • \$ match currency symbol $
  • \d{1,3} match 1 to 3 digits
  • (?:,\d{3})* followed by optional thousands terms
  • (?:\.\d+)? followed by optional decimal component
  • .*? match all content until reaching the nearest
  • \b(?:dry|prod) match dry or prod as a substring
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading