Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Extract Hyperlink from a spool pdf file in Python

I am getting my form data from frontend and reading it using fast api as shown below:

@app.post("/file_upload")
async def upload_file(pdf: UploadFile = File(...)):
    print("Content = ",pdf.content_type,pdf.filename,pdf.spool_max_size)
    return {"filename": "Succcess"}

Now what I need to do is extract hyperlinks from these spool Files with the help of pypdfextractor as shown below:

import pdfx
from os.path import exists
from config import availableUris
def getHrefsFromPDF(pdfPath:str)->dict:
    if not(exists(pdfPath)):
        raise FileNotFoundError("PDF File not Found")
    pdf = pdfx.PDFx(pdfPath)
    return pdf.get_references_as_dict().get('url',[])

But I am not sure how to convert spool file (Received from FAST API) to pdfx readable file format.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Additionally, I also tried to study the bytes that come out of the file. When I try to do this:

data = await pdf.read()

data type shows as : bytes when I try to convert it using str function it gives a unicoded encoded string which is totally a gibberish to me, I also tried to decode using "utf-8" which throws UnicodeDecodeError.

>Solution :

fastapi gives you a SpooledTemporaryFile. You may be able to use that file object directly if there is some api in pdfx which will work on a File() object rather than a str representing a path (!). Otherwise make a new temporary file on disk and work with that:

from tempfile import TemporaryDirectory
import pdfx

@app.post("/file_upload")
async def upload_file(pdf: UploadFile = File(...)):
    with TemporaryDirectory( as d:
        tmpf = d / "pdf.pdf"
        with tmpf.open("wb") as f:
            f.write(pdf.read())

        p = pdfx.PDFX(str(tmpf))
        ...

It may be that pdfx.PDFX will take a Path object. I’ll update this answer if so. I’ve kept the read-write loop synchronous for ease, but you can make it asynchronous if there is a reason to do so.

Note that it would be better to find a way of doing this with the SpooledTemporaryFile.

As to your data showing as bytes: well, pdfs are (basically) binary files: what did you expect?

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading