I’ve been assigned the task of scraping Nvidia’s (https://www.nvidia.com/gtc/session-catalog/) sessions catalogue. This site uses AJAX dynamic loading. I want to be able to get all the data of each catalogue.
Using my very basic approach of scraping via particular element tags, then pressing "Show More" button and repeating the process for next loaded elements, I’ve been able to scrape till around 520-530 sessions. However, after that it repeatedly starts throwing the error XHR 500 Internal Server Error. I’ve done it with and without headless browser (puppeteer).
Why is this happening and how to overcome this? You guys can check out the elements/tags on the website for better understanding.
An answer without using a headless browser, if possible, would be amazing.
I have tried the basic "element tag based" scraping and the same with puppeteer but both aren’t working after 520-540.
>Solution :
Why is this happening
It is apparently a bug in their server side code. Their api apparently does not work when trying to load more than 500 events. Note that the fact that the 500 Internal Server Error happens at 500 events is just a coincidence. Status code 500 is a standard http error code for "something went wrong", and not a reference to the number of events.
If you’re curious to see what calls are being made to the server, you can open the developer tools in your browser and go to the network tab. The api call for loading the next page is https://events.rainfocus.com/api/search. It succeeds initially, then fails once the from parameter reaches 500.
how to overcome this
Unless you know another source of data, there is no overcoming this. Their server will not send more than 500 events to the front end. You could maybe find out if they have some bug reporting process for their website, but that’s a long shot.