Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Regular expression for capturing all text starting at one pattern and ending at another

I am scraping text data off a pdf using python. There is a common pattern that contains the data I need that begins with a numerical pattern and ends with a string pattern. I need to capture all the text, including the patterns using a regular expression.

I have a regular expression that works when I import the data by going pdf to txt and reading the text in. When I use PyPDF2 to extract the text from the pdf pages, the regular expression fails.

The data stream looks like this

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Filed: 8/21/2022\nEntered:  10/21/2022\nDischarged:  01/23/2023\nClosed: 01/30/2023\n17-55018-   \nQRTbk 7 Windows PC\n OS:xxx\nRole: AdminHubertson

The start point is the 17-55018- string which I have a regex that works:

[0-9]{2}-[0-9]{5}-

The end point is the Role: Admin which is unique enough to compile.

I have tried a number of capture methods using lookaheads to get the text I need. These methods I have tested on regex101 and they work but I cannot get them to work

Some patterns I have tried:

[0-9]{2}-[0-9]{5}-\s(\n(?!Role)(.*))*Role: Admin
[0-9]{2}-[0-9]{5}-\.(.*?)Role: Admin
[0-9]{2}-[0-9]{5}-.*(?=Role).*Role: Admin

>Solution :

Try this one:

\d{2}\-\d{5}.*?Role:\sAdmin
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading