Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Get pdf file provided by webapp

I want to download multiple pdf files of old newspapers. Specifically files that look like this or this. My problem is that when I try to automate this process with requests or wget, because the sites don’t give you an actual pdf file, I am not able to get the actual file.

Is there a way to automate this process and download the actual files with Python?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

For this particular web page the pages are served from a predictable url:

This is so regular I wouldn’t even bother extracting it from the page for this problem: I’d just generate the url myself, do a requests.get() for each of them, and splice them together with PyPdf2.

The more general question is: how did I know that url? Have a look at your browser’s devtools:

enter image description here

General approaches

There are basically two solutions to this kind of problem:

  • extract the required parameters from the page (look at how the page builds up the urls it needs), or
  • run a real browser with something like selenium, and automate it.

Sometimes you get lucky and there’s a real api designed to help you do this. It’s quite common when looking at public archive data like this (in France, the apis of the BNF are excellent, but I don’t know what, if anything, would be the Italian equivalent).

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading