Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Python regex – Extract all the matching text between two patterns

I want to extract all the text in the bullet points numbered as 1.1, 1.2, 1.3 etc. Sometimes the bullet points can have space like 1. 1, 1. 2, 1 .3, 1 . 4

Sample text

    text = "some text before pattern 1.1 text_1_here  1.2 text_2_here  1 . 3 text_3_here  1. 4 text_4_here  1 .5 text_5_here 1.10 last_text_here 1.23 text after pattern"

For the text above, the output should be
[‘ text_1_here ‘, ‘ text_2_here ‘, ‘ text_3_here ‘, ‘ text_4_here ‘, ‘ text_5_here ‘, ‘ last_text_here ‘]

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

I tried regex findall but not getting the required output. It is able to identify and extract 1.1 & 1.2 and then 1.3 & 1.4. It is skipping text between 1.2 & 1.3.

    import re
    re.findall(r'[0-9].\s?[0-9]+(.*?)[0-9].\s?[0-9]+', text)

>Solution :

I’m unsure about the exact rule why you’d want to exclude the last bit of text but based on your comments it seems we could also just split the entire text on the bullits and simply exclude the 1st and last element from the resulting array:

re.split(r'\s+\d(?:\s*\.\s*\d+)+\s+', text)[1:-1]

Which would output:

['text_1_here', 'text_2_here', 'text_3_here', 'text_4_here', 'text_5_here', 'last_text_here']
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading