Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Regex Split Match AND Group

I’ve a little regex (\d\.){2,} to split Chapters of a Book. The Chapters are recognized as a single digit followed by a dot and this combination occures at least twice. It should just split Chapters no single listings. Here’s an example:

3.2.4.2. porta pellentesque   
139. Nunc maximus maximus aliquet? 
 a) dignissim 
 b) volutpat  
 c) ullamcorper  

3.2.4.3. ligula at condimentum fringilla  
152. Sed dapibus nulla mi, id lobortis ligula bibendum vehicula?  
 a) vestibulum   
 b) pellentesque   
 c) tempus   
 d) rutrum   
 
153. Lorem ipsum dolor sit amet. Sed iaculis lacus pellentesque, non auctor eros lobortis?  
 a) suscipit   
 b) vulputate   
 c) vestibulum   
 d) congue   
 
3.2.5. elementum quis  

It should be split at 3.2.4.2., 3.2.4.3. and 3.2.5. The regex Builder recognize the correct match but it always add an unwanted group match at the end and i don’t get rid of that. The result looks like (one Bullet is one split):

  • 3.2.4.
  • 2.

  • 3.2.4.
  • 3.

  • 3.2.
  • 5.

I want it to be three splits not nine. I tried it with greedy/lazy quantifiers, various encapsulations but unfortunately I didn’t get it right. What may be worth mentioning is that the whole thing should run in a python project. For a better understanding here is the link to the regexbuilder I used.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

Your capturing group only contains one instance of the number and you match on that group multiple times. If you want to put all your instances in one group you need to put the quantifier in the grop. Since you also probably want to discard the inner group with the quantifier you might want to use ?: to ignore that group.

import re

r = re.compile("((?:\d\.){2,})", re.MULTILINE)
s = """3.2.4.2. porta pellentesque
139. Nunc maximus maximus aliquet?
...
"""

r.findall(s) # ['3.2.4.2.', '3.2.4.3.', '3.2.5.']
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading