Get pdf file provided by webapp

January 10, 2022

I want to download multiple pdf files of old newspapers. Specifically files that look like this or this. My problem is that when I try to automate this process with requests or wget, because the sites don’t give you an actual pdf file, I am not able to get the actual file.

Is there a way to automate this process and download the actual files with Python?

>Solution :

For this particular web page the pages are served from a predictable url:

This is so regular I wouldn’t even bother extracting it from the page for this problem: I’d just generate the url myself, do a requests.get() for each of them, and splice them together with PyPdf2.

The more general question is: how did I know that url? Have a look at your browser’s devtools:

General approaches

There are basically two solutions to this kind of problem:

extract the required parameters from the page (look at how the page builds up the urls it needs), or
run a real browser with something like selenium, and automate it.

Sometimes you get lucky and there’s a real api designed to help you do this. It’s quite common when looking at public archive data like this (in France, the apis of the BNF are excellent, but I don’t know what, if anything, would be the Italian equivalent).