Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Create a DataFrame with ordered data from a scraped table with Beautiful Soup

I am trying to create a DataFrame with the CPI data scraped from a website with Beautiful Soup, sorted by year and Country. The DF would have the following structure:

country year q1 q2 q3 q4 year
Australia 2022 123.9 126.1
Australia 2021 117.9 118.8 119.7 117.2 119.4

I am able to get all the needed data from the following script:

import requests
import pandas as pd
from bs4 import BeautifulSoup as bs


list_countries=['australia','canada','brazil','italy','japan','mexico','new-zealand','france','germany','philippines','india','korea','russia','singapore','switzerland','uk','usa']

url = 'https://www.rateinflation.com/consumer-price-index/australia-historical-cpi/'
r = requests.get(url)
soup = bs(r.text, 'html.parser')
cpi_get = soup.find('table', class_='css-8rh80p eyyd7td0')

for year in cpi_get.find_all('tr'):
    quarters = year.find_all('td')
    print(quarters)

The problem is that the output has the following structure, for each iteration:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

[<td>2021</td>, <td>117.9</td>, <td>118.8</td>, <td>119.7</td>, <td>121.3</td>, <td>119.4</td>]

Having the first td as the year, the second q1,… and consecutively up to year. As it does not have any class or extra info apart of the position, I don´t know how to create the table from that.

Does anyone know how to get the desired output from that?

>Solution :

Try pd.read_html:

import pandas as pd

url = "https://www.rateinflation.com/consumer-price-index/australia-historical-cpi/"

df = pd.read_html(url)[0]
df.rename(
    columns={
        "mar": "q1",
        "jun": "q2",
        "sep": "q3",
        "dec": "q4",
    },
    inplace=True,
)
df["Country"] = "Australia"

# print first 10 rows:
print(df.head(10).to_markdown())

Prints:

Year q1 q2 q3 q4 Annual Country
0 2022 123.9 126.1 nan nan nan Australia
1 2021 117.9 118.8 119.7 121.3 119.4 Australia
2 2020 116.6 114.4 116.2 117.2 116.1 Australia
3 2019 114.1 114.8 115.4 116.2 115.1 Australia
4 2018 112.6 113 113.5 114.1 113.3 Australia
5 2017 110.5 110.7 111.4 112.1 111.2 Australia
6 2016 108.2 108.6 109.4 110 109.1 Australia
7 2015 106.8 107.5 108 108.4 107.7 Australia
8 2014 105.4 105.9 106.4 106.6 106.1 Australia
9 2013 102.4 102.8 104 104.8 103.5 Australia

EDIT:

list_countries = ["australia", "canada"]
url = "https://www.rateinflation.com/consumer-price-index/{}-historical-cpi/"

dfs = []
for country in list_countries:
    df = pd.read_html(url.format(country))[0]
    df["Country"] = country.capitalize()
    dfs.append(df)

df = pd.concat(dfs)
print(df)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading