Extract Hyperlink from a spool pdf file in Python

November 13, 2021

I am getting my form data from frontend and reading it using fast api as shown below:

@app.post("/file_upload")
async def upload_file(pdf: UploadFile = File(...)):
    print("Content = ",pdf.content_type,pdf.filename,pdf.spool_max_size)
    return {"filename": "Succcess"}

Now what I need to do is extract hyperlinks from these spool Files with the help of pypdfextractor as shown below:

import pdfx
from os.path import exists
from config import availableUris
def getHrefsFromPDF(pdfPath:str)->dict:
    if not(exists(pdfPath)):
        raise FileNotFoundError("PDF File not Found")
    pdf = pdfx.PDFx(pdfPath)
    return pdf.get_references_as_dict().get('url',[])

But I am not sure how to convert spool file (Received from FAST API) to pdfx readable file format.

Additionally, I also tried to study the bytes that come out of the file. When I try to do this:

data = await pdf.read()

data type shows as : bytes when I try to convert it using str function it gives a unicoded encoded string which is totally a gibberish to me, I also tried to decode using "utf-8" which throws UnicodeDecodeError.

>Solution :

fastapi gives you a SpooledTemporaryFile. You may be able to use that file object directly if there is some api in pdfx which will work on a File() object rather than a str representing a path (!). Otherwise make a new temporary file on disk and work with that:

from tempfile import TemporaryDirectory
import pdfx

@app.post("/file_upload")
async def upload_file(pdf: UploadFile = File(...)):
    with TemporaryDirectory( as d:
        tmpf = d / "pdf.pdf"
        with tmpf.open("wb") as f:
            f.write(pdf.read())

        p = pdfx.PDFX(str(tmpf))
        ...

It may be that pdfx.PDFX will take a Path object. I’ll update this answer if so. I’ve kept the read-write loop synchronous for ease, but you can make it asynchronous if there is a reason to do so.

Note that it would be better to find a way of doing this with the SpooledTemporaryFile.

As to your data showing as bytes: well, pdfs are (basically) binary files: what did you expect?