Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Extracting hrefs or specific tag from a page

I have been trying numerous ways but this website is proving very hard to scrape via bs4.

I am trying to extract the href value found in the snip below on one of the matches. the id is to extract all href tags from the page into a list. I am not returning any values the ideal result is a list containing all hrefs eg //www.premierleague.com/match/74911

enter image description here

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

import warnings
import numpy as np
from datetime import datetime
import requests
from bs4 import BeautifulSoup

warnings.filterwarnings('ignore')

# set up empty dataframe in a list for storage. errors is set up to handle any matches that dont scrape.
dataframe = []
errors = []

url = "https://www.premierleague.com/results"

response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

matches = {}

soup.find_all("div", {"class": "competitionContainer"})

>Solution :

The data you see on the page is loaded from external source via JavaScript (you can open Web developer tools in your browser -> Network tab and start scrolling the page down. You should see the Ajax request there):

import json
import requests

api_url = "https://footballapi.pulselive.com/football/fixtures"

params = {
    "comps": "1",
    "compSeasons": "489",
    "teams": "127,1,2,130,131,4,6,7,34,9,26,10,11,12,23,15,20,21,25,38",
    "page": "1",
    "pageSize": "40",
    "sort": "desc",
    "statuses": "C",
    "altIds": "true",
}

headers = {
    'Origin': 'https://www.premierleague.com',
}

page = 0
while True:
    params['page'] = page
    data = requests.get(api_url, params=params, headers=headers).json()

    # uncoment this to print all data:
    # print(json.dumps(data, indent=4))

    for c in data['content']:
        team1, team2 = c['teams'][0]['team']['name'], c['teams'][1]['team']['name']
        print(f'{team1:<30} {team2:<30} https://www.premierleague.com/match/{int(c["id"])}')

    if page > data['pageInfo']['numPages']:
        break

    page += 1

Prints:


...

Chelsea                        Tottenham Hotspur              https://www.premierleague.com/match/74925
Nottingham Forest              West Ham United                https://www.premierleague.com/match/74928
Brentford                      Manchester United              https://www.premierleague.com/match/74923
Arsenal                        Leicester City                 https://www.premierleague.com/match/74921
Brighton & Hove Albion         Newcastle United               https://www.premierleague.com/match/74924
Manchester City                Bournemouth                    https://www.premierleague.com/match/74927
Southampton                    Leeds United                   https://www.premierleague.com/match/74929
Wolverhampton Wanderers        Fulham                         https://www.premierleague.com/match/74930
Aston Villa                    Everton                        https://www.premierleague.com/match/74922
West Ham United                Manchester City                https://www.premierleague.com/match/74920
Leicester City                 Brentford                      https://www.premierleague.com/match/74916
Manchester United              Brighton & Hove Albion         https://www.premierleague.com/match/74919
Everton                        Chelsea                        https://www.premierleague.com/match/74913
Bournemouth                    Aston Villa                    https://www.premierleague.com/match/74912
Leeds United                   Wolverhampton Wanderers        https://www.premierleague.com/match/74915
Newcastle United               Nottingham Forest              https://www.premierleague.com/match/74917
Tottenham Hotspur              Southampton                    https://www.premierleague.com/match/74918
Fulham                         Liverpool                      https://www.premierleague.com/match/74914
Crystal Palace                 Arsenal                        https://www.premierleague.com/match/74911
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading