Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

pdf form data extraction with pypdf : how to get only key+values?

I’m not a python guru (lire used to R).

I use pypdf package (v3.4.1) to extract data from a pdf form I have created and filled with acrobat.

I can read the form fields with

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

f = PdfReader('test_formulaire.pdf')
ffields = f.get_fields()

ffields is a dict object of size 3 (3 keys : ‘a1′,’a2′,’a5’). Each "key" of the dict is a Field class object.

I can access the value of a key with print(ffields['a1'].value)

I now want to create a pandas dataframe with a column for each key of ffields (3 columns, named with the key name) and a row containing the values of each key…

Is there a quick and easy way to do it ?

I can create an empty dataframe with the column names with something like that (probably far from optimal) :

column_names = ["" for x in range(len(ffields))]
idx=0
for i in ffields:
    column_names[idx]=i
    idx+=1

data = pd.DataFrame(columns=column_names)

An filling it should be possible with other for loops but it seems ugly… (note that some values are numbers and other are strings).

Does anybody have a hint for doing this quite efficiently.

Thanks in advance

>Solution :

You can create a pandas dataframe with the values of the form fields by looping through the keys of ffields and appending the values to a list. Here’s an example:

import pandas as pd
from pypdfocr.pypdfocr import PdfReader

# Read the PDF form
pdf = PdfReader('test_formulaire.pdf')

# Get the form fields
ffields = pdf.get_fields()

# Initialize lists for each column
a1_values = []
a2_values = []
a5_values = []

# Loop through the keys of ffields and append the values to the appropriate list
for key in ffields.keys():
    if key == 'a1':
        a1_values.append(ffields[key].value)
    elif key == 'a2':
        a2_values.append(ffields[key].value)
    elif key == 'a5':
        a5_values.append(ffields[key].value)

# Create the pandas dataframe
data = pd.DataFrame({
    'a1': a1_values,
    'a2': a2_values,
    'a5': a5_values
})

print(data)

This will output a pandas dataframe with three columns named ‘a1’, ‘a2’, and ‘a5’, respectively, and each row containing the values of the corresponding form field. If a form field is empty, the corresponding cell in the dataframe will contain a NaN value.

Note that if the number of form fields is large and you want to automate the creation of the column names, you can use a list comprehension to extract the keys from ffields and pass them directly to the columns parameter of the pd.DataFrame constructor:

column_names = list(ffields.keys())
data = pd.DataFrame(columns=column_names)

This will create a pandas dataframe with columns named after the keys in ffields.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading