Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

RegEx to Search in Multiline Text

assume I have a text file like this: (Multiline) (Notice label1,label2,label3)

label1 Lorem ipsum dolor sit amet, consectetur adipiscing 
Lorem ipsum dolor-
Lorem ipsum dolor sit amet (dolor sit)
Hint Lorem ipsum dolor sit amet, consectetur adipiscing elit. Praesent tellus
lkorem ipsum dolor.
Start Value : 0.25 [kg]
label2 Lorem ipsum dolor sit amer, consecflsşl
the solor dolar
Start Value : 8000 [mg]
label3 Start Value : 0.3 [kg]

In here aside from lorem ipsum texts, I have label1, label2, label3 as strings in a list.

For every label, I need to get "Start Value : xxx [yy]" They can be positioned in a list or dictionary.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

For example, for label1 in this text I need to get: "Start Value : 0.25 [kg]"

There may be lines between labels and their start value, or they may be side to side line in last line.

In my idea, I need to use RegEx to search string areas starting with – label name, and ends with a string where the string starts with "Start Value : " and ends with "]" How can I complete this task?

So far I tried re.findall(…) but could not understand.

>Solution :

One approach would be to first use re.findall to find all text blocks belonging to each label. Then iterate that result and find the Start Value lines for each label.

inp = """label1 Lorem ipsum dolor sit amet, consectetur adipiscing 
Lorem ipsum dolor-
Lorem ipsum dolor sit amet (dolor sit)
Hint Lorem ipsum dolor sit amet, consectetur adipiscing elit. 
Praesent tellus
lkorem ipsum dolor.
Start Value : 0.25 [kg]
label2 Lorem ipsum dolor sit amer, consecflsşl
the solor dolar
Start Value : 8000 [mg]
label3 Start Value : 0.3 [kg]"""

labels = ["label1", "label2", "label3"]
regex = r'\b(?:' + r'|'.join(labels) + r')\b'
matches = re.findall(r'(?:^|(?<=\n))' + regex + r'.*?(?=' + regex + r'|$)', inp, flags=re.S)
output = [re.search(r'\bStart Value\s*:\s*\d+(?:\.\d+)?\s*\[\w+\]', x).group() for x in matches]
print(output)

# ['Start Value : 0.25 [kg]', 'Start Value : 8000 [mg]', 'Start Value : 0.3 [kg]']
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading