I am scraping text data off a pdf using python. There is a common pattern that contains the data I need that begins with a numerical pattern and ends with a string pattern. I need to capture all the text, including the patterns using a regular expression.
I have a regular expression that works when I import the data by going pdf to txt and reading the text in. When I use PyPDF2 to extract the text from the pdf pages, the regular expression fails.
The data stream looks like this
Filed: 8/21/2022\nEntered: 10/21/2022\nDischarged: 01/23/2023\nClosed: 01/30/2023\n17-55018- \nQRTbk 7 Windows PC\n OS:xxx\nRole: AdminHubertson
The start point is the 17-55018- string which I have a regex that works:
[0-9]{2}-[0-9]{5}-
The end point is the Role: Admin which is unique enough to compile.
I have tried a number of capture methods using lookaheads to get the text I need. These methods I have tested on regex101 and they work but I cannot get them to work
Some patterns I have tried:
[0-9]{2}-[0-9]{5}-\s(\n(?!Role)(.*))*Role: Admin
[0-9]{2}-[0-9]{5}-\.(.*?)Role: Admin
[0-9]{2}-[0-9]{5}-.*(?=Role).*Role: Admin
>Solution :
Try this one:
\d{2}\-\d{5}.*?Role:\sAdmin