How to scrape the URL?

February 21, 2023

I wanted to scrape the title, link, and date of news articles on a company’s site. I’m struggling with the link because there’s a Read More button and want the next line where it has the url. I don’t know much about html, and haven’t been finding the solution in my searches so far. How do I access the next line under the read more button?

My Code:

from bs4 import BeautifulSoup as soup
from numpy.lib.function_base import extract
import requests
import pandas as pd

URL = 'https://ir.axcellatx.com/news-releases'

html_text = requests.get(URL).text
chickennoodle = soup(html_text, 'html.parser')

lists = chickennoodle.find_all("article", class_ = "clearfix node node--nir-news--nir-widget-list node--type-nir-news node--view-mode-nir-widget-list node--promoted")
for list in lists:
   title = list.find("div", class_ = "nir-widget--field nir-widget--news--headline").text
   link = list.find("div", class_ = "nir-widget--field nir-widget--news--read-more").text
   date = list.find("div", class_ = "nir-widget--field nir-widget--news--date-time").text
   info = [title, link, date]
   print(info)

HTML looks something like:

<article role="article" class="clearfix node node--nir-news--nir-widget-list node--type-nir-news node--view-mode-nir-widget-list node--promoted">
   <div class="nir-widget--field nir-widget--news--date-time"> Feb 15, 2023 </div>
   <div class="nir-widget--field nir-widget--news--headline">
       " Axcella Announces FDA IND Clearance Supporting Regulatory Path to Registration of AXA1125 for Long COVID Fatigue "
   <div class="nir-widget--field nir-widget--news--read-more">
      <a href="/news-releases/news-release-details/axcella-announces-fda-ind-clearance-supporting-regulatory-path" hreflang="en">Read more</a>

>Solution :

You can use .a property to find the next <a> tag and then get the value of href= attribute:

import requests
from bs4 import BeautifulSoup as soup

URL = 'https://ir.axcellatx.com/news-releases'

html_text = requests.get(URL).text
chickennoodle = soup(html_text, 'html.parser')

lists = chickennoodle.find_all("article", class_ = "clearfix node node--nir-news--nir-widget-list node--type-nir-news node--view-mode-nir-widget-list node--promoted")
for lst in lists:
   title = lst.find("div", class_ = "nir-widget--field nir-widget--news--headline").text.strip()
   link = 'https://ir.axcellatx.com' + lst.find("div", class_ = "nir-widget--field nir-widget--news--read-more").a['href']
   date = lst.find("div", class_ = "nir-widget--field nir-widget--news--date-time").text.strip()
   print(title)
   print(link)
   print(date)
   print()

Prints:

Axcella Announces FDA IND Clearance Supporting Regulatory Path to Registration of AXA1125 for Long COVID Fatigue
https://ir.axcellatx.com/news-releases/news-release-details/axcella-announces-fda-ind-clearance-supporting-regulatory-path
Feb 15, 2023

Axcella Therapeutics to Participate in the SVB Securities’ 2023 Global Biopharma Conference
https://ir.axcellatx.com/news-releases/news-release-details/axcella-therapeutics-participate-svb-securities-2023-global
Feb 09, 2023

Axcella Announces Regulatory Path to Registration of AXA1125 for Long COVID Fatigue
https://ir.axcellatx.com/news-releases/news-release-details/axcella-announces-regulatory-path-registration-axa1125-long
Jan 23, 2023