I use regualr expressions frequently, but often in the same similar ways. I sometimes run across this scenario where I’d like to capture strings with optional whole words in them. I’ve come up with the method below but I suspect there’s a better way, just not sure what it is? An example is a string like this:
For the purposes of this order, the sum of $5,476,958.00 is the estimated total costs of the initial unit well covered hereby as dry hole and for the purposes of this order, the sum of $12,948,821.00 is the estimated total costs of such initial unit well as a producing well
My goal is to capture both portions of the string beginning with the dollar sign $ and ending with either word dry or prod. In the example the whole word is producing, but sometimes it’s a variation of the word such as production, so prod is fine. The captured results should be:
['$5,476,958.00 is the estimated total costs of the initial unit well covered hereby as dry', '$12,948,821.00 is the estimated total costs of such initial unit well as a prod']
which I get with this not so elegant expression:
[val[0] for val in re.findall('(\$[0-9,\.]+[a-z ,]+total cost.*?(dry|prod)+)', line, flags=re.IGNORECASE)]
Is there a better, more correct, way to accomplish it than this?
>Solution :
We can use re.findall here:
inp = "For the purposes of this order, the sum of $5,476,958.00 is the estimated total costs of the initial unit well covered hereby as dry hole and for the purposes of this order, the sum of $12,948,821.00 is the estimated total costs of such initial unit well as a producing well"
matches = re.findall(r'\$\d{1,3}(?:,\d{3})*(?:\.\d+)?.*?\b(?:dry|prod)', inp)
print(matches)
This prints:
['$5,476,958.00 is the estimated total costs of the initial unit well covered hereby as dry',
'$12,948,821.00 is the estimated total costs of such initial unit well as a prod']
Here is an explanation of the regex pattern being used:
\$match currency symbol$\d{1,3}match 1 to 3 digits(?:,\d{3})*followed by optional thousands terms(?:\.\d+)?followed by optional decimal component.*?match all content until reaching the nearest\b(?:dry|prod)matchdryorprodas a substring