I’m not a python guru (lire used to R).
I use pypdf package (v3.4.1) to extract data from a pdf form I have created and filled with acrobat.
I can read the form fields with
f = PdfReader('test_formulaire.pdf')
ffields = f.get_fields()
ffields is a dict object of size 3 (3 keys : ‘a1′,’a2′,’a5’). Each "key" of the dict is a Field class object.
I can access the value of a key with print(ffields['a1'].value)
I now want to create a pandas dataframe with a column for each key of ffields (3 columns, named with the key name) and a row containing the values of each key…
Is there a quick and easy way to do it ?
I can create an empty dataframe with the column names with something like that (probably far from optimal) :
column_names = ["" for x in range(len(ffields))]
idx=0
for i in ffields:
column_names[idx]=i
idx+=1
data = pd.DataFrame(columns=column_names)
An filling it should be possible with other for loops but it seems ugly… (note that some values are numbers and other are strings).
Does anybody have a hint for doing this quite efficiently.
Thanks in advance
>Solution :
You can create a pandas dataframe with the values of the form fields by looping through the keys of ffields and appending the values to a list. Here’s an example:
import pandas as pd
from pypdfocr.pypdfocr import PdfReader
# Read the PDF form
pdf = PdfReader('test_formulaire.pdf')
# Get the form fields
ffields = pdf.get_fields()
# Initialize lists for each column
a1_values = []
a2_values = []
a5_values = []
# Loop through the keys of ffields and append the values to the appropriate list
for key in ffields.keys():
if key == 'a1':
a1_values.append(ffields[key].value)
elif key == 'a2':
a2_values.append(ffields[key].value)
elif key == 'a5':
a5_values.append(ffields[key].value)
# Create the pandas dataframe
data = pd.DataFrame({
'a1': a1_values,
'a2': a2_values,
'a5': a5_values
})
print(data)
This will output a pandas dataframe with three columns named ‘a1’, ‘a2’, and ‘a5’, respectively, and each row containing the values of the corresponding form field. If a form field is empty, the corresponding cell in the dataframe will contain a NaN value.
Note that if the number of form fields is large and you want to automate the creation of the column names, you can use a list comprehension to extract the keys from ffields and pass them directly to the columns parameter of the pd.DataFrame constructor:
column_names = list(ffields.keys())
data = pd.DataFrame(columns=column_names)
This will create a pandas dataframe with columns named after the keys in ffields.