I am trying to scrape a website for a table but only the header is being returned.
I am new to python and web scraping and have followed the following material which was very helpful https://medium.com/analytics-vidhya/how-to-scrape-a-table-from-website-using-python-ce90d0cfb607.
However, the following code only returns the header and not the body of the table.
# Create an URL object
url = 'https://www.dividendmax.com/dividends/declared'
# Create object page
page = requests.get(url)
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
page = urlopen(req)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")
# Obtain information from tag <table>
table1 = soup.find_all('table')
table1
Output:
[<table aria-label="Declared Dividends" class="mdc-data-table__table">
<thead>
<tr class="mdc-data-table__header-row">
<th class="mdc-data-table__header-cell" role="columnheader" scope="col">Company</th>
<th class="mdc-data-table__header-cell" role="columnheader" scope="col">Ticker</th>
<th class="mdc-data-table__header-cell" role="columnheader" scope="col">Country</th>
<th class="mdc-data-table__header-cell" role="columnheader" scope="col">Exchange</th>
<th class="mdc-data-table__header-cell" role="columnheader" scope="col">Share Price</th>
<th class="mdc-data-table__header-cell" role="columnheader" scope="col">Prev. Dividend</th>
<th class="mdc-data-table__header-cell" role="columnheader" scope="col">Next Dividend</th>
<th class="mdc-data-table__header-cell" role="columnheader" scope="col">Next Ex-date</th>
</tr>
</thead>
<tbody></tbody>
</table>]
I need to retrieve the tbody content (found when expanding the penultimate row of output).
Just as an FYI, the following code will be used to create the dataframe.
# Obtain every title of columns with tag <th>
headers = []
for i in table1.find_all('th'):
title = i.text
headers.append(title)
# Create a dataframe
mydata = pd.DataFrame(columns = headers)
# Create a for loop to fill mydata
for j in table1.find_all('tr')[1:]:
row_data = j.find_all('td')
row = [i.text for i in row_data]
length = len(mydata)
mydata.loc[length] = row
>Solution :
The page you are after is not the same as the tutorial. Probably not the best site if your trying to learn/practice with beautifulsoup. But the data for me comes back in a nice json format.
import requests
import pandas as pd
# Create an URL object
url = 'https://www.dividendmax.com/dividends/declared'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'}
jsonData = requests.get(url, headers=headers).json()
df = pd.DataFrame(jsonData)
Output:
print(df)
name ... ind
0 3i Group plc ... [22, 25, 23, 3, 5]
1 3I Infrastructure Plc ... [4, 5]
2 AB Dynamics plc ... []
3 Aberdeen Smaller Companies Income Trust plc ... []
4 Aberdeen Standard Equity Income Trust plc ... []
.. ... ... ...
146 Workspace Group ... [25, 4, 24, 5]
147 Wynnstay Properties ... []
148 XP Power Ltd ... [5, 4]
149 Yew Grove REIT Plc ... []
150 Yougov ... []
[151 rows x 11 columns]