Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to webscrape multiple tables on same website

I am trying to webscrape multiple tables from this website

this is my code

def scrape_ranking(url, sheet_name):
    with sync_playwright() as pw:
        browser = pw.chromium.launch()
        page = browser.new_page()
        page.goto(url, wait_until="networkidle")
        soup = BeautifulSoup(page.content(), "html.parser")
        table = soup.select_one(".table_bd")
        print("done step 1")

        if table is None:
            print("Table not found.")
        else:
            df = pd.read_html(str(table))[0]
            print(df)
            with pd.ExcelWriter("jockeyclub.xlsx", engine="openpyxl", mode='a', if_sheet_exists='overlay') as writer:
                df.to_excel(writer, sheet_name=sheet_name, index=True, startrow = 70)

url_trainer = "https://racing.hkjc.com/racing/information/english/racing/Draw.aspx#race1.aspx"
scrape_ranking(url_trainer, "Race Card 1")

This code is able to print the table for Race Card 1. However, when I change the line to df = pd.read_html(str(table))[1] or df = pd.read_html(str(table))[2], it isn’t able to find any other tables in the website.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Is there a way to print all the tables on the site?

>Solution :

In this case there seems no need to use a mix of modules. – Simply use pandas.red_html(), select the tables with attrs and iterate the list of dataframes:

import pandas as pd

url_trainer = "https://racing.hkjc.com/racing/information/english/racing/Draw.aspx#race1.aspx"

for table in pd.read_html(url_trainer, attrs={'class':'table_bd'}):
    print(table)
    # other tasks you have to perform on the dataframes

EDIT

Based on your provided url first part will work, but if you miss the #race1... in the end of the url, the site reacts slightly differently, providing a user-agent will fix this:

import pandas as pd
import requests

url_trainer = "https://racing.hkjc.com/racing/information/english/racing/Draw.aspx"

list_of_df = pd.read_html(
                requests.get(
                    url_trainer, 
                    headers={'user-agent':'Mozilla/5.0'}
                ).text, 
                attrs={'class':'table_bd'}
            )

for table in list_of_df:
    print(table)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading