Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Python – Get a list of URLs from a complicated html file for scarping purposes

I am new to web scraping and could not get the list of URLs in the ‘a’ tags from this website: http://www.tauntondevelopment.org//msip/JHRindex.htm. All I get is an empty list- clients list: []
Thank you for your help!

Here is my code:

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

# This is the url of one major industrial park that we will be scraping
park_url = "http://www.tauntondevelopment.org//msip/JHRindex.htm"

uPark = uReq(park_url)
park_html = uPark.read()
uPark.close()

park_soup = soup(park_html, "html.parser")

filename = "ParkText.html"
f = open(filename, "w") 
f.write(park_soup.prettify())
f.close()

# get a list of the urls of park_url    
clients_list = []
for link in park_soup.findAll('li'):
    clients_list.append(link.get('href'))

print("clients list:", clients_list)

# write clients to a file 
filename = "taunton_JHR.csv"

f = open(filename, "w") # 
headers = "Name, Email, Address\n"
f.write(headers)
 
for client_url in clients_list:
    # call the function to scrape the individual park data
    client_url = "http://www.tauntondevelopment.org/msip/" + client_url
    try: 
        uClient = uReq(client_url)
    except:
        print("Error: Unable to open url")
        continue # continue to the next client_url in the list

    client_name, client_email, client_address = scrapeIndPark(uClient)
    
    f.write(client_name  + "," + client_email + "," + client_address + "\n")
    
f.close()


MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

Did you try to look into html you downloading?

<html>
 <head>
  <title>
   John Hancock Road
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
 </head>
 <frameset bordercolor="#E0E0E0" cols="25%,591*">
  <frame name="index" src="JHRleft.htm" target="content"/>
  <frame name="content" src="JHRright.htm"/>
 </frameset>
 <noframes>
  <body bgcolor="#FFFFFF">
  </body>
 </noframes>
 <frameset>
 </frameset>
</html>

Notice that (at least in my case) it’s empty! It’s because page is builded with frames. To access frame you need to go to page, run network inspector, go to network tab and see url the latter requests (filling the frames with data) are sent. In that case the url you searching for is probably http://www.tauntondevelopment.org//msip/JHRleft.htm

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading