I wrote the code as follws to extract one single pdf file and put the text into a list. how can I modify the code that it iterates over a dictionary of pdf files and their name and create a dictionary and put the name and corresponding text into it?
dic = {
'0R.pdf':'m1',
'2R.pdf':'m2',
'29R.pdf':'m3'}
def readpdffile(pdf_file):
pdfFileObj = open(pdf_file, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
output = []
for i in range(pdfReader.numPages):
pageObj = pdfReader.getPage(i)
output.append(pageObj.extractText())
return output
>Solution :
You can modify the code to iterate over the dictionary of pdf files and their names, and store the extracted text and the corresponding name in a dictionary using the following code:
import PyPDF2
dic = {
'0R.pdf':'m1',
'2R.pdf':'m2',
'29R.pdf':'m3'
}
def read_pdffiles(dictionary):
result = {}
for pdf_file, name in dictionary.items():
pdfFileObj = open(pdf_file, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
output = []
for i in range(pdfReader.numPages):
pageObj = pdfReader.getPage(i)
output.append(pageObj.extractText())
result[name] = output
pdfFileObj.close()
return result
result = read_pdffiles(dic)
print(result)
The read_pdffiles function takes a dictionary containing the pdf filenames and their corresponding names as input, and returns a dictionary containing the name and the extracted text as key-value pairs. The function opens each pdf file using the filename and extracts the text from each page using the PyPDF2 module. The extracted text is then stored in a list and the list is stored in the dictionary using the corresponding name as the key. The function finally returns the resulting dictionary.
You can call the read_pdffiles function with the dic dictionary as input, and store the resulting dictionary in a variable like result. The resulting dictionary will have the name and the corresponding extracted text for each pdf file as key-value pairs. You can print the resulting dictionary to verify the output.