Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Python – inserting header into a csv

I’m developing a script that extracts text from all pdf files in a directory via a loop and inserts them into individual cells of a csv file. I can successfully write the output into the cells. However, I need the csv file to contain the header "text" for merging with another csv. Thus far my attempts to insert that header with csv_writer are running into difficulties.

For example, the code below successfully extracts and inserts the text from pdfs, but writes a new header for every file extracted:

import pdfplumber
import csv
import glob

pdfs = glob.glob("dir\*.pdf")

for pf in pdfs:
    with pdfplumber.open(pf) as pdf, \
        open("pdf_output.csv", "a", newline="", encoding="utf-8") as f_output:

        csv_output = csv.writer(f_output)
        csv_output.writerow(['text']) # code for inserting header
        text = []

        for page in pdf.pages:
            extracted_text = page.extract_text()

            if extracted_text:  
                text.append(extracted_text)

        csv_output.writerow([' '.join(text)])

The other approach I’ve attempted is likewise unsuccessful. I tried to first write the header into the csv, and append the output of the loop to the csv. However, for some reason the formatting of the pdf output is completely disrupted, with text scattered across multiple cells instead of a single cell.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

pdfs = glob.glob("dir\*.pdf")

# code for writing header
file = open("pdf_output.csv", "w", newline="")
writer = csv.writer(file)
headers = ['text']
writer.writerow(headers)

for pf in pdfs:
    with pdfplumber.open(pf) as pdf, \
        open("pdf_output.csv", "a", newline="", encoding="utf-8") as f_output:

        csv_output = csv.writer(f_output)

        text = []

        for page in pdf.pages:
            extracted_text = page.extract_text()

            if extracted_text:  
                text.append(extracted_text)

        csv_output.writerow([' '.join(text)])

Any suggestions on workarounds or better approaches for this challenge would be immensely welcome.

>Solution :

You could open the csv first, insert your header, then iterate through your PDFs:

import pdfplumber
import csv
import glob

pdfs = glob.glob("dir\*.pdf")

with open("pdf_output.csv", "a", newline="", encoding="utf-8") as f_output:
    csv_output = csv.writer(f_output)
    csv_output.writerow(['text'])
    
for pf in pdfs:
    with pdfplumber.open(pf) as pdf, \
    open("pdf_output.csv", "a", newline="", encoding="utf-8") as f_output:
 
        csv_output = csv.writer(f_output)
        text = []

        for page in pdf.pages:
            extracted_text = page.extract_text()

            if extracted_text:  
                text.append(extracted_text)

        csv_output.writerow([' '.join(text)])

Or just check if its the first iteration:

import pdfplumber
import csv
import glob

pdfs = glob.glob("dir\*.pdf")

for i, pf in enumerate(pdfs):
    with pdfplumber.open(pf) as pdf, \
    open("pdf_output.csv", "a", newline="", encoding="utf-8") as f_output:
    
        csv_output = csv.writer(f_output)
        if i == 0: csv_output.writerow(['text'])

        text = []

        for page in pdf.pages:
            extracted_text = page.extract_text()

            if extracted_text:  
                text.append(extracted_text)

        csv_output.writerow([' '.join(text)])
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading