Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

To extract texts in selected page(s) from PDF

By the use of pdfminer / pdfminer.six, I wish to extract the texts in pdf.

When trying to extract the texts on selected page(s) only, it gives an error:

AttributeError: 'generator' object has no attribute 'seek'
# from this line "parser = PDFParser(page_selected)"

What’s the right way to extract the texts in selected page(s) only? Thank you.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Here is the code I have:

from io import StringIO
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
import pdfminer.high_level as hl

working_folder = "C:\\temp\\"

output_string = StringIO()

with open(working_folder + 'AU.pdf', 'rb') as in_file:

    page_selected = hl.extract_pages(in_file, page_numbers=[1])   # second page

    parser = PDFParser(page_selected)     # error line

    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)

    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)

print(output_string.getvalue())

>Solution :

There is a dedicated method for text extraction.

from pdfminer.high_level import extract_text

working_folder = 'C:\\temp\\'  # wow, r'C:\temp\' breaks code highlighting
file_name = f'{working_folder}AU.pdf'
page_number = 1

page_text = extract_text(file_name, page_numbers=[page_number])
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading