I have one domain that I would like to do an extensive search on with all URLs equal except for the ending, which I have compiled in a list.
The goal is to search each of these pages and identify the pages that return a ‘Page not found’ error, and return them to the console – my search would be for this:
<h1 class="center error-page-center" id="error-message">Oops! We couldn’t find what you wanted.</h1>
I have been trying to do this using BeutifulSoup, so far to no avail
import psycopg2
from bs4 import BeautifulSoup
for page in pages:
page = connection.cursor.execute("select concat('https:mywebsite/',integervalue) as url from table")
search = soup.find_all(class_="center error-page-center"):
if len(search)>0:
print ("Needs removal")
Considering that there are several thousand pages I want to search, are there other approaches that would be more efficient?
My code will result in error, I am more interested in looking for any advice on how to search a list of URLs for that specific body of text.
Thank you
>Solution :
for url in urls:
try:
urllib2.urlopen(url)
except urllib2.HTTPError, e:
# Do something when request fails
print e.code
else:
print(url) # succesfful url
# write your code.
Python tool to check broken links on a big urls list