Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

parsing actual link from href using lxml

Using jupyter notebook, python 3.
I’m downloading some files from the web much with them in bulk locally. the files are listed on a webpage, but they are in an href. The code I found gives me the text, but not the actual link (even though my understanding is the code was supposed to get the link).

Here’s what I have:

import os
import requests
from lxml import html
from lxml import etree
import urllib.request
import urllib.parse
...
web_string = requests.get(url).content
parsed_content = html.fromstring(web_string)
td_list = [e for e in parsed_content.iter() if e.tag == 'td']

directive_list = []
for td_e in td_list:
   txt = td_e.text_content()
   
   directive_list.append(txt)

This is long web page with a bunch of entries that look like
<a href="file1.pdf"> text1 </a>

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

This code returns: text1, text2, etc. instead of file1.pdf, file2.pdf

How can I extract the link?

>Solution :

You can modify your code to specifically look for a elements within each td and extract their href attribute.

import requests
from lxml import html

url = 'YOUR_URL_HERE'  # Replace with your URL
web_string = requests.get(url).content
parsed_content = html.fromstring(web_string)

# Find all 'a' elements inside 'td' elements
links = parsed_content.xpath('//td/a')

directive_list = []
for link in links:
    # Get the href attribute
    href = link.get('href')

    # You might want to join this with the base URL if they are relative links
    # href = urllib.parse.urljoin(url, href)

    directive_list.append(href)

# Print the list of links
print(directive_list)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading