Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Extracting RegEx pattern across list excluding other html code

I’ve written a script to pull a list of available report url extensions page available for text extraction.

I’ve used parsing and BeautifulSoup to extract the reference area for the latest report using this method.

home = BeautifulSoup(home_url, 'html.parser')
container = home.find('div', attrs={'class': 'list'})
report_url_locations = list(x for x in container.findAll('a'))

This generates a list with each report and it’s unique html extension, which is updated each time a new report is uploaded, for example:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

[<a href="2022-05/13/c_76843.htm">May 16: Daily report</a>,
 <a href="2022-05/12/c_76842.htm">May 15: Daily report</a>,
 <a href="2022-05/11/c_76841.htm">May 14: Daily report</a>,
 <a href="2022-05/10/c_76839.htm">May 13: Daily report</a>]

I’ve managed to write some code to strip out html junk and keep just the extension for the first element (i.e. first report).

latest_sitrep_location = str(report_url_locations[0])
latest_sitrep_htm_location = re.search(r"[0-9]+-[0-9]+/[0-9]+/+c_[0-9]+.+htm",latest_sitrep_location)

This gives me:

"2022-05/13/c_76843.htm"

But when I try to do this for every element of the list it just throws me all the junk in-between:

all_urls= re.findall(r"[0-9]+-[0-9]+/[0-9]+/+c_[0-9]+.+htm", str(report_url_locations))
all_urls

['2022-05/13/c_76843.htm">May 16: Daily Report</a>, <a href="2022-05/12/c_76842.htm">May 15: Daily Report</a>, <a href="2022-05/11/c_76841.htm">May 14: Daily Report</a>, <a href="2022-05/10/c_76839.htm">May 13: Daily Report</a>]

But what I want is:

["2022-05/13/c_76843.htm","2022-05/12/c_76842.htm","2022-05/11/c_76841.htm","2022-05/10/c_76839.htm"]

Can somebody tell me what I need to include in my RegEx to ensure the other html is excluded? I’m fairly sure I need to convert every element in report_url_locations to be strings, but I don’t know how to do this en-masse.

>Solution :

Why don’t you just try this:

report_url_locations = [x["href"] for x in container.findAll('a')]

And then just print the report_url_locations

By the way, here’s why you shouldn’t be using regex to parse an HTML.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading