Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

From a list of URLs search each page and return results with matching page type

I have one domain that I would like to do an extensive search on with all URLs equal except for the ending, which I have compiled in a list.

The goal is to search each of these pages and identify the pages that return a ‘Page not found’ error, and return them to the console – my search would be for this:

<h1 class="center error-page-center" id="error-message">Oops! We couldn’t find what you wanted.</h1>

I have been trying to do this using BeutifulSoup, so far to no avail

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

import psycopg2
from bs4 import BeautifulSoup

for page in pages:
    page = connection.cursor.execute("select concat('https:mywebsite/',integervalue) as url from table")
    search = soup.find_all(class_="center error-page-center"):
        if len(search)>0: 
        print ("Needs removal")

Considering that there are several thousand pages I want to search, are there other approaches that would be more efficient?

My code will result in error, I am more interested in looking for any advice on how to search a list of URLs for that specific body of text.
Thank you

>Solution :

for url in urls:
    try:
        urllib2.urlopen(url)
    except urllib2.HTTPError, e:
        # Do something when request fails
        print e.code
    else:
        print(url) # succesfful url
                   # write your code.

Python tool to check broken links on a big urls list

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading