I have a larger Python 3 program that processes OCR outputs and some bubble detection and I have it mostly worked out. I have one function that I got off Stack Overflow that works but has a weird side effect and since I do not understand the code very well I would like to get a little help coming up with something that works as I would like.
Here is the code I am using now:
Link
How it works:
I have a text file we can call address.txt that looks like this:
First Name,
Address,
City State Zip,
Second Name,
Second Address,
Second City State zip,
I would like to convert that to this:
First Name, Address, City State Zip,
Second Name, Second Address, City State Zip,
Ideally I would have it write to address.txt in the format I want to start, rather then create the file and have to edit the file afterwards using the above function I picked up from stack overflow. Here is my function that reads the images creates the file and adds commas at the end of each line.
If I could get it to line up every three lines in one line I would not need the above code at all.
def tess_address():
files = os.listdir("address")
sorted_files = sorted(files)
for image in sorted_files:
# read image
output = "address/" + image
# Pass the image through pytesseract
text = pytesseract.image_to_string(output)
#remove all commas
no_comma_text = re.sub(",", "", text)
for line in no_comma_text.splitlines():
#print to file
print(line + ",", file=open("address" + '.txt', 'a', encoding='utf8'))
>Solution :
Since i don’t have the address.csv file,i could only think through this thought process. To modify your tess_address function so it directly formats the OCR output into the desired CSV format without needing a separate step to edit the file, you can adjust the loop that processes each line. Instead of appending each line with a comma and writing it directly to the file, you can accumulate lines in groups of three and then write each group as a single line in the CSV file like this below.
import os
import pytesseract
import re
def tess_address():
# Ensure the output directory exists
output_dir = "address"
os.makedirs(output_dir, exist_ok=True)
files = os.listdir(output_dir)
sorted_files = sorted(files)
output_file_path = os.path.join(output_dir, 'addresses.csv')
with open(output_file_path, 'w', encoding='utf8') as output_file:
for image in sorted_files:
# Construct the full path for the image
image_path = os.path.join(output_dir, image)
# Pass the image through pytesseract
text = pytesseract.image_to_string(image_path)
# Remove all commas from the OCR output
no_comma_text = re.sub(",", "", text)
# Initialize a list to accumulate lines
accumulated_lines = []
for line in no_comma_text.splitlines():
accumulated_lines.append(line)
# Once we have three lines accumulated, write them as a single line in the CSV
if len(accumulated_lines) == 3:
# Join the three lines with commas, add a trailing comma, and write to the file
output_file.write(', '.join(accumulated_lines) + ',\n')
# Reset the accumulator for the next group of lines
accumulated_lines = []
# Handle any remaining lines in case the total number is not a multiple of three
if accumulated_lines:
output_file.write(', '.join(accumulated_lines) + ',\n')