Take text file and create csv file

February 12, 2024

I have a larger Python 3 program that processes OCR outputs and some bubble detection and I have it mostly worked out. I have one function that I got off Stack Overflow that works but has a weird side effect and since I do not understand the code very well I would like to get a little help coming up with something that works as I would like.

Here is the code I am using now:
Link

How it works:
I have a text file we can call address.txt that looks like this:

First Name,
Address,
City State Zip,
Second Name,
Second Address,
Second City State zip,

I would like to convert that to this:

First Name, Address, City State Zip,
Second Name, Second Address, City State Zip,

Ideally I would have it write to address.txt in the format I want to start, rather then create the file and have to edit the file afterwards using the above function I picked up from stack overflow. Here is my function that reads the images creates the file and adds commas at the end of each line.
If I could get it to line up every three lines in one line I would not need the above code at all.

def tess_address():
    files = os.listdir("address")
    sorted_files = sorted(files)
    for image in sorted_files:
        # read image
        output = "address/" + image
        # Pass the image through pytesseract
        text = pytesseract.image_to_string(output)
        #remove all commas
        no_comma_text = re.sub(",", "", text)
        for line in no_comma_text.splitlines():
            #print to file
            print(line + ",", file=open("address" + '.txt', 'a', encoding='utf8'))

>Solution :

Since i don’t have the address.csv file,i could only think through this thought process. To modify your tess_address function so it directly formats the OCR output into the desired CSV format without needing a separate step to edit the file, you can adjust the loop that processes each line. Instead of appending each line with a comma and writing it directly to the file, you can accumulate lines in groups of three and then write each group as a single line in the CSV file like this below.

import os
import pytesseract
import re

def tess_address():
    # Ensure the output directory exists
    output_dir = "address"
    os.makedirs(output_dir, exist_ok=True)

    files = os.listdir(output_dir)
    sorted_files = sorted(files)
    output_file_path = os.path.join(output_dir, 'addresses.csv')

    with open(output_file_path, 'w', encoding='utf8') as output_file:
        for image in sorted_files:
            # Construct the full path for the image
            image_path = os.path.join(output_dir, image)

            # Pass the image through pytesseract
            text = pytesseract.image_to_string(image_path)

            # Remove all commas from the OCR output
            no_comma_text = re.sub(",", "", text)

            # Initialize a list to accumulate lines
            accumulated_lines = []

            for line in no_comma_text.splitlines():
                accumulated_lines.append(line)
                # Once we have three lines accumulated, write them as a single line in the CSV
                if len(accumulated_lines) == 3:
                    # Join the three lines with commas, add a trailing comma, and write to the file
                    output_file.write(', '.join(accumulated_lines) + ',\n')
                    # Reset the accumulator for the next group of lines
                    accumulated_lines = []

            # Handle any remaining lines in case the total number is not a multiple of three
            if accumulated_lines:
                output_file.write(', '.join(accumulated_lines) + ',\n')