Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Extracting chosen information from URL results into a dataframe

I would like to create a dataframe by pulling only certain information from this website.

https://www.stockrover.com/build/production/Research/tail.js?1644930560

I would like to pull all the entries like this one. ["0005.HK","HSBC HOLDINGS","",""]

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Another problem is, suppose I only want only the first 20,000 lines which is the stock information and there is other information after line 20,000 that I don’t want included in the dataframe.

To summarize, could someone show me how to pull out just the information I’m trying to extract and create a dataframe with those results if this is possible.

A sample of the website results

function getStocksLibraryArray(){return[["0005.HK","HSBC HOLDINGS","",""],["0006.HK","Power Assets Holdings Ltd","",""],["000660.KS","SK hynix","",""],["004370.KS","Nongshim","",""],["005930.KS","Samsung Electroni","",""],["0123.HK","YUEXIU PROPERTY","",""],["0336.HK","HUABAO INTL","",""],["0408.HK","YIP'S CHEMICAL","",""],["0522.HK","ASM PACIFIC","",""],["0688.HK","CHINA OVERSEAS","",""],["0700.HK","TENCENT","",""],["0762.HK","CHINA UNICOM","",""],["0808.HK","PROSPERITY REIT","",""],["0813.HK","SHIMAO PROPERTY",

Code to pull all lines including ones not wanted

import requests
import pandas as pd
import requests

url = "https://www.stockrover.com/build/production/Research/tail.js?1644930560"

payload={}
headers = {}

response = requests.request("GET", url, headers=headers, data=payload)

print(response.text)

>Solution :

Use regex to extract the details followed by literal_eval to convert string to python object

import re
from ast import literal_eval

import pandas as pd
import requests

url = "https://www.stockrover.com/build/production/Research/tail.js?1644930560"

response = requests.request("GET", url, headers={}, data={})

regex_ = re.compile(r"getStocksLibraryArray\(\)\{return(.+?)}", re.DOTALL)

print(pd.DataFrame(literal_eval(regex_.search(response.text).group(1))))

               0                          1       2 3 
0        0005.HK              HSBC HOLDINGS           
1        0006.HK  Power Assets Holdings Ltd           
2      000660.KS                   SK hynix           
3      004370.KS                   Nongshim           
4      005930.KS          Samsung Electroni           
...          ...                        ...     ... ..
21426      ZZHGF         ZhongAn Online P&C  _INSUP   
21427      ZZHGY         ZhongAn Online P&C  _INSUP   
21428       ZZLL      ZZLL Information Tech  _INTEC   
21429     ZZZ.TO       Sleep Country Canada  _SPECR   
21430      ZZZOF         Zinc One Resources  _OTHEI   
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading