Home Iterating over a dictionary of pdf files and their name and create a dictionary and put the name and corresponding text into it

Questions

Iterating over a dictionary of pdf files and their name and create a dictionary and put the name and corresponding text into it

byMR

April 14, 2023

I wrote the code as follws to extract one single pdf file and put the text into a list. how can I modify the code that it iterates over a dictionary of pdf files and their name and create a dictionary and put the name and corresponding text into it?

dic = {
 '0R.pdf':'m1',
 '2R.pdf':'m2',
 '29R.pdf':'m3'}

def readpdffile(pdf_file):
        pdfFileObj = open(pdf_file, 'rb')
        pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
        output = []
        for i in range(pdfReader.numPages):
            pageObj = pdfReader.getPage(i)
            output.append(pageObj.extractText())
    
        return output

>Solution :

You can modify the code to iterate over the dictionary of pdf files and their names, and store the extracted text and the corresponding name in a dictionary using the following code:

import PyPDF2

dic = {
 '0R.pdf':'m1',
 '2R.pdf':'m2',
 '29R.pdf':'m3'
}

def read_pdffiles(dictionary):
    result = {}
    for pdf_file, name in dictionary.items():
        pdfFileObj = open(pdf_file, 'rb')
        pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
        output = []
        for i in range(pdfReader.numPages):
            pageObj = pdfReader.getPage(i)
            output.append(pageObj.extractText())
        result[name] = output
        pdfFileObj.close()
    return result

result = read_pdffiles(dic)
print(result)

The read_pdffiles function takes a dictionary containing the pdf filenames and their corresponding names as input, and returns a dictionary containing the name and the extracted text as key-value pairs. The function opens each pdf file using the filename and extracts the text from each page using the PyPDF2 module. The extracted text is then stored in a list and the list is stored in the dictionary using the corresponding name as the key. The function finally returns the resulting dictionary.

You can call the read_pdffiles function with the dic dictionary as input, and store the resulting dictionary in a variable like result. The resulting dictionary will have the name and the corresponding extracted text for each pdf file as key-value pairs. You can print the resulting dictionary to verify the output.