parsing actual link from href using lxml

October 26, 2023

Using jupyter notebook, python 3.
I’m downloading some files from the web much with them in bulk locally. the files are listed on a webpage, but they are in an href. The code I found gives me the text, but not the actual link (even though my understanding is the code was supposed to get the link).

Here’s what I have:

import os
import requests
from lxml import html
from lxml import etree
import urllib.request
import urllib.parse
...
web_string = requests.get(url).content
parsed_content = html.fromstring(web_string)
td_list = [e for e in parsed_content.iter() if e.tag == 'td']

directive_list = []
for td_e in td_list:
   txt = td_e.text_content()
   
   directive_list.append(txt)

This is long web page with a bunch of entries that look like
<a href="file1.pdf"> text1 </a>

This code returns: text1, text2, etc. instead of file1.pdf, file2.pdf

How can I extract the link?

>Solution :

You can modify your code to specifically look for a elements within each td and extract their href attribute.

import requests
from lxml import html

url = 'YOUR_URL_HERE'  # Replace with your URL
web_string = requests.get(url).content
parsed_content = html.fromstring(web_string)

# Find all 'a' elements inside 'td' elements
links = parsed_content.xpath('//td/a')

directive_list = []
for link in links:
    # Get the href attribute
    href = link.get('href')

    # You might want to join this with the base URL if they are relative links
    # href = urllib.parse.urljoin(url, href)

    directive_list.append(href)

# Print the list of links
print(directive_list)