assume I have a text file like this: (Multiline) (Notice label1,label2,label3)
label1 Lorem ipsum dolor sit amet, consectetur adipiscing
Lorem ipsum dolor-
Lorem ipsum dolor sit amet (dolor sit)
Hint Lorem ipsum dolor sit amet, consectetur adipiscing elit. Praesent tellus
lkorem ipsum dolor.
Start Value : 0.25 [kg]
label2 Lorem ipsum dolor sit amer, consecflsşl
the solor dolar
Start Value : 8000 [mg]
label3 Start Value : 0.3 [kg]
In here aside from lorem ipsum texts, I have label1, label2, label3 as strings in a list.
For every label, I need to get "Start Value : xxx [yy]" They can be positioned in a list or dictionary.
For example, for label1 in this text I need to get: "Start Value : 0.25 [kg]"
There may be lines between labels and their start value, or they may be side to side line in last line.
In my idea, I need to use RegEx to search string areas starting with – label name, and ends with a string where the string starts with "Start Value : " and ends with "]" How can I complete this task?
So far I tried re.findall(…) but could not understand.
>Solution :
One approach would be to first use re.findall to find all text blocks belonging to each label. Then iterate that result and find the Start Value lines for each label.
inp = """label1 Lorem ipsum dolor sit amet, consectetur adipiscing
Lorem ipsum dolor-
Lorem ipsum dolor sit amet (dolor sit)
Hint Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Praesent tellus
lkorem ipsum dolor.
Start Value : 0.25 [kg]
label2 Lorem ipsum dolor sit amer, consecflsşl
the solor dolar
Start Value : 8000 [mg]
label3 Start Value : 0.3 [kg]"""
labels = ["label1", "label2", "label3"]
regex = r'\b(?:' + r'|'.join(labels) + r')\b'
matches = re.findall(r'(?:^|(?<=\n))' + regex + r'.*?(?=' + regex + r'|$)', inp, flags=re.S)
output = [re.search(r'\bStart Value\s*:\s*\d+(?:\.\d+)?\s*\[\w+\]', x).group() for x in matches]
print(output)
# ['Start Value : 0.25 [kg]', 'Start Value : 8000 [mg]', 'Start Value : 0.3 [kg]']