Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to scrape the URL?

I wanted to scrape the title, link, and date of news articles on a company’s site. I’m struggling with the link because there’s a Read More button and want the next line where it has the url. I don’t know much about html, and haven’t been finding the solution in my searches so far. How do I access the next line under the read more button?

My Code:

from bs4 import BeautifulSoup as soup
from numpy.lib.function_base import extract
import requests
import pandas as pd

URL = 'https://ir.axcellatx.com/news-releases'

html_text = requests.get(URL).text
chickennoodle = soup(html_text, 'html.parser')

lists = chickennoodle.find_all("article", class_ = "clearfix node node--nir-news--nir-widget-list node--type-nir-news node--view-mode-nir-widget-list node--promoted")
for list in lists:
   title = list.find("div", class_ = "nir-widget--field nir-widget--news--headline").text
   link = list.find("div", class_ = "nir-widget--field nir-widget--news--read-more").text
   date = list.find("div", class_ = "nir-widget--field nir-widget--news--date-time").text
   info = [title, link, date]
   print(info)

HTML looks something like:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

<article role="article" class="clearfix node node--nir-news--nir-widget-list node--type-nir-news node--view-mode-nir-widget-list node--promoted">
   <div class="nir-widget--field nir-widget--news--date-time"> Feb 15, 2023 </div>
   <div class="nir-widget--field nir-widget--news--headline">
       " Axcella Announces FDA IND Clearance Supporting Regulatory Path to Registration of AXA1125 for Long COVID Fatigue "
   <div class="nir-widget--field nir-widget--news--read-more">
      <a href="/news-releases/news-release-details/axcella-announces-fda-ind-clearance-supporting-regulatory-path" hreflang="en">Read more</a>

>Solution :

You can use .a property to find the next <a> tag and then get the value of href= attribute:

import requests
from bs4 import BeautifulSoup as soup

URL = 'https://ir.axcellatx.com/news-releases'

html_text = requests.get(URL).text
chickennoodle = soup(html_text, 'html.parser')

lists = chickennoodle.find_all("article", class_ = "clearfix node node--nir-news--nir-widget-list node--type-nir-news node--view-mode-nir-widget-list node--promoted")
for lst in lists:
   title = lst.find("div", class_ = "nir-widget--field nir-widget--news--headline").text.strip()
   link = 'https://ir.axcellatx.com' + lst.find("div", class_ = "nir-widget--field nir-widget--news--read-more").a['href']
   date = lst.find("div", class_ = "nir-widget--field nir-widget--news--date-time").text.strip()
   print(title)
   print(link)
   print(date)
   print()

Prints:

Axcella Announces FDA IND Clearance Supporting Regulatory Path to Registration of AXA1125 for Long COVID Fatigue
https://ir.axcellatx.com/news-releases/news-release-details/axcella-announces-fda-ind-clearance-supporting-regulatory-path
Feb 15, 2023

Axcella Therapeutics to Participate in the SVB Securities’ 2023 Global Biopharma Conference
https://ir.axcellatx.com/news-releases/news-release-details/axcella-therapeutics-participate-svb-securities-2023-global
Feb 09, 2023

Axcella Announces Regulatory Path to Registration of AXA1125 for Long COVID Fatigue
https://ir.axcellatx.com/news-releases/news-release-details/axcella-announces-regulatory-path-registration-axa1125-long
Jan 23, 2023
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading