Simplify the problem:
There is an article (long text)
Extract the content between start (included) and end (included)
Requirement: There cannot be more than one \n between start and end
Find all matches
Use python re only
For code:
lines = re.findall(pattern, text, re.DOTALL)
for line in lines:
print(line)
print('===')
So, how can I fixed my pattern?
What I try pattern:
start[^\n]*\n?[^\n]*end
with text:
...
start just me and python regex 1 end
start just me and python regex 2 end
start just me and python regex 3 end
...
wrong:
start just me and python regex 1 end
start just me and python regex 2 end --> should be split with the line before
===
start just me and python regex 3 end
===
start(?:(?!\n\n).)*?endandstart(?:[^\n]|\n(?!\n))*?end
with text:
start just
me and python
regex 1 end
start just me and python regex 2 end
start just me and python regex 3 end
wrong:
start just
me and python
regex 1 end --> should not match this cause there is two `\n` in
===
start just me and python regex 2 end
===
start just me and python regex 3 end
===
>Solution :
you can use: start[^\n]*?\n?[^\n]*?end