Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to scrape data from a line chart on bltindex.com?

I want to scrape the data from the only line chart on https://www.bltindex.com/
The goal is to in the end have a pandas DataFrame with one time series from the chart in it

After watching this video I tried to apply the same method and look for some csv or json file in the Network of the page while the page was loading, but could not find any. The only thing I found was a css file that had the word "chart" in it with a link https://docs.google.com/static/spreadsheets2/client/css/838001818-v3-ritz_chart_css_ltr.css and saw that it had a request link as well (it is in the code below)
I tried the following code:

import requests
from bs4 import BeautifulSoup

url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vQG9TYlv8_LpCvO7EI3Y3s8MoxQEfOHTd3-EqccN5PoeHcdxraxZC0y8UWFx_2NnogVIIuk1i-phvFe/pubchart?oid=813038046&format=interactive'
html = requests.get(url)

soup = BeautifulSoup(html.content)
print(soup.prettify())

The code returned a string and in the <script nonce="yyTSUqBQUPTxI-ZkIM7OKw"> I indeed saw the values that I want to get. However, I do not know how to get them from this string without doing it manually. Is there perhaps some more convenient way to get the data?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

Try:

import json
import re

import pandas as pd
import requests
from bs4 import BeautifulSoup

url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vQG9TYlv8_LpCvO7EI3Y3s8MoxQEfOHTd3-EqccN5PoeHcdxraxZC0y8UWFx_2NnogVIIuk1i-phvFe/pubchart?oid=813038046&format=interactive"
html_text = requests.get(url).text


data = re.search(r"'chartJson': '(.*?)',", html_text).group(1)
data = re.sub(r"\\x(..)", lambda g: chr(int(g.group(1), 16)), data)
data = json.loads(data)

# print(json.dumps(data, indent=4))

df = pd.DataFrame(
    [(r["c"][0]["f"], r["c"][1]["f"]) for r in data["dataTable"]["rows"]],
    columns=["Date", "Value"],
)
print(df)

Prints:

            Date       Value
0    07-Jan-2018           1
1    14-Jan-2018   1.0396913
2    21-Jan-2018  0.84593582
3    28-Jan-2018  0.78201258
4    04-Feb-2018  0.71397352
5    11-Feb-2018   0.8097111
6    18-Feb-2018   1.2938001
7    25-Feb-2018  0.95799756
8    04-Mar-2018  0.81667918

...
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading