Using jupyter notebook, python 3.
I’m downloading some files from the web much with them in bulk locally. the files are listed on a webpage, but they are in an href. The code I found gives me the text, but not the actual link (even though my understanding is the code was supposed to get the link).
Here’s what I have:
import os
import requests
from lxml import html
from lxml import etree
import urllib.request
import urllib.parse
...
web_string = requests.get(url).content
parsed_content = html.fromstring(web_string)
td_list = [e for e in parsed_content.iter() if e.tag == 'td']
directive_list = []
for td_e in td_list:
txt = td_e.text_content()
directive_list.append(txt)
This is long web page with a bunch of entries that look like
<a href="file1.pdf"> text1 </a>
This code returns: text1, text2, etc. instead of file1.pdf, file2.pdf
How can I extract the link?
>Solution :
You can modify your code to specifically look for a elements within each td and extract their href attribute.
import requests
from lxml import html
url = 'YOUR_URL_HERE' # Replace with your URL
web_string = requests.get(url).content
parsed_content = html.fromstring(web_string)
# Find all 'a' elements inside 'td' elements
links = parsed_content.xpath('//td/a')
directive_list = []
for link in links:
# Get the href attribute
href = link.get('href')
# You might want to join this with the base URL if they are relative links
# href = urllib.parse.urljoin(url, href)
directive_list.append(href)
# Print the list of links
print(directive_list)