Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Web scrape using Python – Execution takes too long

I am trying to webscrape the "Active Positions" table from the following website:

https://www.nasdaq.com/market-activity/stocks/aapl/institutional-holdings

My code is below:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

from bs4 import BeautifulSoup
import requests

html_text = requests.get('https://www.nasdaq.com/market-activity/stocks/aapl/institutional-holdings')
soup = BeautifulSoup(html_text, 'lxml')
job1 = soup.find('div', classs_ = 'dialog-off-canvas-main-canvas')
job2 = job1.find('div', class_ = 'page with-primary-nav hide-more-videos')
job3 = job2.find('div', class_ = 'page__main')
job4 = job3.find('div', class_ = 'page__content')
job5 = job4.find('div', class_ = 'quote-subdetail__content quote-subdetail__content--new')
job6 = job5.findAll('div', class_ = 'layout layout--2-col-large')
job7 = job6.find('div', class_ = 'institutional-holdings institutional-holdings--paginated')
job8 = job7.find('div', class_ = 'institutional-holdings__section institutional-holdings__section--active-positions')
job9 = job8.find('div', class_ = 'institutional-holdings__table-container')
job10 = job9.find('table', class_ = 'institutional-holdings__table')
job11 = job10.find('tbody', class_ = 'institutional-holdings__body')
job12 = job11.findAll('tr', class_ = 'institutional-holdings__row').text

print(job12)

I have chosen to include nearly every class path to attempt to speed up the execution, as including only a couple took up to 10 minutes before i decided to interupt. However, i still get the same long execution with no output. Is there something wrong with my code? Or can I improve this by doing something I haven’t thought of? Thanks.

>Solution :

Data is being hydrated in page via Javascript XHR calls. Here is a way of getting ActivePositions by scraping the API endpoint directly:

import requests
import pandas as pd

url = 'https://api.nasdaq.com/api/company/AAPL/institutional-holdings?limit=15&type=TOTAL&sortColumn=marketValue&sortOrder=DESC'

headers = {
    'accept': 'application/json, text/plain, */*',
    'origin': 'https://www.nasdaq.com',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}

r = requests.get(url, headers=headers)
df = pd.json_normalize(r.json()['data']['activePositions']['rows'])
print(df)

Result in terminal:

positions   holders shares
0   Increased Positions 1,780   239,170,203
1   Decreased Positions 2,339   209,017,331
2   Held Positions  283 8,965,339,255
3   Total Institutional Shares  4,402   9,413,526,789

In case you want to scrape the big 4,402 Institutional Holders table, there are ways for that too.

EDIT: Here is how you can save the data to a json file:

df.to_json('active_positions.json')

Although it might make more sense to save it as tabular data (csv):

df.to_csv('active_positions.csv')

Pandas docs: https://pandas.pydata.org/docs/

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading