Split block of text when starts with numbering

I have blocks of text in Excel files. Certain cells have points of text numbered and some are just paragraphs. For the text with numbering, I want to split those into individual lines.

Example Text:

text = """1. Line 1 of the text
2. Line 2 of the text
   subline of line 2 of the text
3. Line 3 of the text
   - sublineA of line 3 text
   - sublineB of line 3 text"""

My code: text.split("\n")

But this split across each lines and gives me ['1. Line 1 of the text', '2. Line 2 of the text', ' subline of line 2 of the text', '3. Line 3 of the text', ' - sublineA of line 3 text', ' - sublineB of line 3 text']

But I need ['1. Line 1 of the text', '2. Line 2 of the text \nsubline of line 2 of the text', '3. Line 3 of the text \n- sublineA of line 3 text \n- sublineB of line 3 text'] basically split text only when start with numbers.

>Solution :

I think that this problem can be easily solved using regex. Here’s the code:

import re

lines = re.split("\n(?=[0-9])",text)

First, the re.split function will split a string on all matches of the pattern. The matches themselves thus won’t be included in the string.

The pattern starts with \n, a newline character. Then, we have (?=, the start of a lookahead group. Lookaheads in regex are parts that need to be behind the match, but aren’t included in the match. We don’t wan’t the number to be included in the match, as that would result in the numbers themselves not to be in the resulting lines.

Inside the lookahead, we have [0-9]. This means any character from zero to nine, thus any digit. Finally, there is a closing paranthesis to end the lookahead.

Leave a Reply