Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to properly extract blocks of data from a file using my RegEx string?

Introduction

I am trying to parse information using RegEx which is structured like this:

1. Data
  A. Data sub 1
  B. Data sub 2
2. Data
  A. Data sub 1
  B. Data sub 2
  C. Data sub 3
  D. Data sub 4  
3. Data
  A. Data sub 1

Each piece of information is a new line, I could go line by line, but I believe that a RegEx string would be sufficient to defeat this issue.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Intention

I would like to extract it block by block, where a block would be:

1. Data
  A. Data sub 1
  B. Data sub 2

My attempt

I was able to observe that there is a "pattern" in this data and though that I could try to extract it using the next RegEx string:

(?s)(?=1.)(.*?)(?=(2. ))

Which succesfully extracts a block, but if the block contains a number such that it is include in the expresision, the block extracted is incompleted and corrupts the output file

What I expect

I would like to extract the data blocks without being interrupted by a string or char found between the defined start and end.

>Solution :

I would use re.split here, splitting on a newline if it is followed by \d\.:

text = '''1. Data
  A. Data sub 1
  B. Data sub 2
2. Data
  A. Data sub 1
  B. Data sub 2
  C. Data sub 3
  D. Data sub 4  
3. Data
  A. Data sub 1'''

import re

blocks = re.split('\s*\n(?=\d+\.)', text)

output:

['1. Data\n  A. Data sub 1\n  B. Data sub 2',
 '2. Data\n  A. Data sub 1\n  B. Data sub 2\n  C. Data sub 3\n  D. Data sub 4',
 '3. Data\n  A. Data sub 1']

In a loop:

for block in re.split('\s*\n(?=\d+\.)', text):
    print('--- NEW BLOCK ---')
    print(block)

output:

--- NEW BLOCK ---
1. Data
  A. Data sub 1
  B. Data sub 2
--- NEW BLOCK ---
2. Data
  A. Data sub 1
  B. Data sub 2
  C. Data sub 3
  D. Data sub 4
--- NEW BLOCK ---
3. Data
  A. Data sub 1
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading