Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Extracting first numerical value occuring after some token in text in python

I have sentences in the following form. I want to extract all numeric values occurring after any given token. For example, I want to extract all numeric values after the phrase "tangible net worth"

Example sentences:

  1. "A company must maintain a minimum tangible net worth of $100000000 and leverage ratio of 0.5"
  2. "Minimum required tangible net worth the firm needs to maintain is $50000000".

From both of these sentences, I want to extract "$100000000" and "$50000000" and create a dictionary like this:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

{
    "tangible net worth": "$100000000"
}

I am unsure how to use the re python module to achieve this. Also, one needs to be careful here, a significant portion of sentences contain multiple numeric values. So, I want only to extract the immediate value occurring after the match. I have tried the following expressions, but none of them are giving desired results

re.search(r'net worth.*(\d+)', sent)
re.search(r'(net worth)(.*)(\d+)', sent)
re.search(r'(net worth)(.*)(\d?)', sent)
re.findall(r'tangible net worth (.*)?(\d* )', sent)
re.findall(r'tangible net worth (.*)?( \d* )', sent)
re.findall(r'tangible net worth (.*)?(\d)', sent)

A little help with the regular expression will be highly appreciated. Thanks.

>Solution :

You could use this regex:

tangible net worth\D*(\d+)

which will skip any non-digit characters after tangible net worth before capturing the first digits that occur after it.

You can then place the result into a dict. Note I would recommend storing a number rather than a string as you can always format it on output (adding $, comma thousands separators etc).

strs = [
    "A company must maintain a minimum tangible net worth of $100000000 and leverage ratio of 0.5",
    "Minimum required tangible net worth the firm needs to maintain is $50000000"
]

result = []
for sent in strs:
    m = re.findall(r'tangible net worth\D*(\d+)', sent)
    if m:
        result += [{ 'tangible net worth' : int(m[0]) }]

print(result)

Output:

[
 {'tangible net worth': 100000000},
 {'tangible net worth': 50000000}
]
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading