What is the best way to match optional whole words with python regex

February 25, 2023

I use regualr expressions frequently, but often in the same similar ways. I sometimes run across this scenario where I’d like to capture strings with optional whole words in them. I’ve come up with the method below but I suspect there’s a better way, just not sure what it is? An example is a string like this:

For the purposes of this order, the sum of $5,476,958.00 is the estimated total costs of the initial unit well covered hereby as dry hole and for the purposes of this order, the sum of $12,948,821.00 is the estimated total costs of such initial unit well as a producing well

My goal is to capture both portions of the string beginning with the dollar sign $ and ending with either word dry or prod. In the example the whole word is producing, but sometimes it’s a variation of the word such as production, so prod is fine. The captured results should be:

['$5,476,958.00 is the estimated total costs of the initial unit well covered hereby as dry', '$12,948,821.00 is the estimated total costs of such initial unit well as a prod']

which I get with this not so elegant expression:
[val[0] for val in re.findall('(\$[0-9,\.]+[a-z ,]+total cost.*?(dry|prod)+)', line, flags=re.IGNORECASE)]

Is there a better, more correct, way to accomplish it than this?

>Solution :

We can use re.findall here:

inp = "For the purposes of this order, the sum of $5,476,958.00 is the estimated total costs of the initial unit well covered hereby as dry hole and for the purposes of this order, the sum of $12,948,821.00 is the estimated total costs of such initial unit well as a producing well"
matches = re.findall(r'\$\d{1,3}(?:,\d{3})*(?:\.\d+)?.*?\b(?:dry|prod)', inp)
print(matches)

This prints:

['$5,476,958.00 is the estimated total costs of the initial unit well covered hereby as dry',
 '$12,948,821.00 is the estimated total costs of such initial unit well as a prod']

Here is an explanation of the regex pattern being used: