I am trying to webscrape multiple tables from this website
this is my code
def scrape_ranking(url, sheet_name):
with sync_playwright() as pw:
browser = pw.chromium.launch()
page = browser.new_page()
page.goto(url, wait_until="networkidle")
soup = BeautifulSoup(page.content(), "html.parser")
table = soup.select_one(".table_bd")
print("done step 1")
if table is None:
print("Table not found.")
else:
df = pd.read_html(str(table))[0]
print(df)
with pd.ExcelWriter("jockeyclub.xlsx", engine="openpyxl", mode='a', if_sheet_exists='overlay') as writer:
df.to_excel(writer, sheet_name=sheet_name, index=True, startrow = 70)
url_trainer = "https://racing.hkjc.com/racing/information/english/racing/Draw.aspx#race1.aspx"
scrape_ranking(url_trainer, "Race Card 1")
This code is able to print the table for Race Card 1. However, when I change the line to df = pd.read_html(str(table))[1] or df = pd.read_html(str(table))[2], it isn’t able to find any other tables in the website.
Is there a way to print all the tables on the site?
>Solution :
In this case there seems no need to use a mix of modules. – Simply use pandas.red_html(), select the tables with attrs and iterate the list of dataframes:
import pandas as pd
url_trainer = "https://racing.hkjc.com/racing/information/english/racing/Draw.aspx#race1.aspx"
for table in pd.read_html(url_trainer, attrs={'class':'table_bd'}):
print(table)
# other tasks you have to perform on the dataframes
EDIT
Based on your provided url first part will work, but if you miss the #race1... in the end of the url, the site reacts slightly differently, providing a user-agent will fix this:
import pandas as pd
import requests
url_trainer = "https://racing.hkjc.com/racing/information/english/racing/Draw.aspx"
list_of_df = pd.read_html(
requests.get(
url_trainer,
headers={'user-agent':'Mozilla/5.0'}
).text,
attrs={'class':'table_bd'}
)
for table in list_of_df:
print(table)