Home Extracting RegEx pattern across list excluding other html code

Questions

Extracting RegEx pattern across list excluding other html code

May 16, 2022

I’ve written a script to pull a list of available report url extensions page available for text extraction.

I’ve used parsing and BeautifulSoup to extract the reference area for the latest report using this method.

home = BeautifulSoup(home_url, 'html.parser')
container = home.find('div', attrs={'class': 'list'})
report_url_locations = list(x for x in container.findAll('a'))

This generates a list with each report and it’s unique html extension, which is updated each time a new report is uploaded, for example:

[<a href="2022-05/13/c_76843.htm">May 16: Daily report</a>,
 <a href="2022-05/12/c_76842.htm">May 15: Daily report</a>,
 <a href="2022-05/11/c_76841.htm">May 14: Daily report</a>,
 <a href="2022-05/10/c_76839.htm">May 13: Daily report</a>]

I’ve managed to write some code to strip out html junk and keep just the extension for the first element (i.e. first report).

latest_sitrep_location = str(report_url_locations[0])
latest_sitrep_htm_location = re.search(r"[0-9]+-[0-9]+/[0-9]+/+c_[0-9]+.+htm",latest_sitrep_location)

This gives me:

"2022-05/13/c_76843.htm"

But when I try to do this for every element of the list it just throws me all the junk in-between:

all_urls= re.findall(r"[0-9]+-[0-9]+/[0-9]+/+c_[0-9]+.+htm", str(report_url_locations))
all_urls

['2022-05/13/c_76843.htm">May 16: Daily Report</a>, <a href="2022-05/12/c_76842.htm">May 15: Daily Report</a>, <a href="2022-05/11/c_76841.htm">May 14: Daily Report</a>, <a href="2022-05/10/c_76839.htm">May 13: Daily Report</a>]

But what I want is:

["2022-05/13/c_76843.htm","2022-05/12/c_76842.htm","2022-05/11/c_76841.htm","2022-05/10/c_76839.htm"]

Can somebody tell me what I need to include in my RegEx to ensure the other html is excluded? I’m fairly sure I need to convert every element in report_url_locations to be strings, but I don’t know how to do this en-masse.

>Solution :

Why don’t you just try this:

report_url_locations = [x["href"] for x in container.findAll('a')]

And then just print the report_url_locations

By the way, here’s why you shouldn’t be using regex to parse an HTML.

byMR

Published May 16, 2022

Add a comment

Construct a data frame with pandas

byMR

May 16, 2022

Questions

Add legend to NetworkX graph (using matplotlib) based on node attribute

byMR

May 16, 2022

Questions

React Native useState setInterval function not working properly

byMR

May 16, 2022

Questions

Recurrence relation: T(n) = n*T(n/2)

byMR

May 16, 2022

Questions

Create a function in R having these input and outputs

byMR

May 16, 2022

Questions

iOS – How to setup app analytics provided by Apple in Xcode?

byMR

May 16, 2022

Extracting RegEx pattern across list excluding other html code

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Like this:

Leave a ReplyCancel reply

Read more

Construct a data frame with pandas

Add legend to NetworkX graph (using matplotlib) based on node attribute

React Native useState setInterval function not working properly

Recurrence relation: T(n) = n*T(n/2)

Create a function in R having these input and outputs

iOS – How to setup app analytics provided by Apple in Xcode?

Keep Up to Date with the Most Important News

Extracting RegEx pattern across list excluding other html code

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Share this:

Like this:

Leave a ReplyCancel reply

Keep Up to Date with the Most Important News

Read more

Construct a data frame with pandas

Add legend to NetworkX graph (using matplotlib) based on node attribute

React Native useState setInterval function not working properly

Recurrence relation: T(n) = n*T(n/2)

Create a function in R having these input and outputs

iOS – How to setup app analytics provided by Apple in Xcode?

Discover more from Dev solutions