Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Tesseract OCR accents problems, image enhancement not enough

I really need your help with Tesseract.
I’m using Tesseract and pdf2image to extract informations from a scanned PDF file.
My problem is that Tesseract messes with the accents Ă©, è et ĂŞ (i’m french) and with the lowercase "i" and upcase "I".
I tried processing the images first but can’t get any good output.

This the code i’m using:

pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe' 

filePath = askopenfilename()
img = convert_from_path(filePath,poppler_path=r'C:\poppler-0.68.0_x86\poppler-0.68.0\bin')
path, fileName = os.path.split(filePath)
fileBaseName, fileExtension = os.path.splitext(fileName)


for page_number in range(len(img)):
    img[page_number].save(r'C:\Users\488096\Documents\page'+ str(page_number) +'.jpg', 'JPEG')

    
work_img = None
# Tesseract
custom_config = r'--oem 3 --psm 6'
kernel = np.ones((1, 1), np.uint8)

for page_number in range(len(img)):
    img1 = cv2.imread(r'C:\Users\488096\Documents\page'+ str(page_number) +'.jpg')
    #Traitement des images afin d'obtenir une meilleure reconnaissance des caractères
    gray = cv2.cvtColor(img1, cv2.COLOR_BGR2GRAY)
    # Remove shadows
    cool_img = cv2.dilate(gray, kernel, iterations=1)
    norm_img = cv2.erode(cool_img, kernel, iterations=1)
    # Threshold using Otsu's
    work_img = cv2.threshold(cv2.bilateralFilter(norm_img, 5, 75, 75), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

    # Save pages as images in the pdf
    txt = txt + (pytesseract.image_to_string(work_img,config=custom_config).encode("utf-8")).decode('utf-8')
    print("Page # {} - {}".format(str(page_number),txt))

What can I do to obtain good results ?
Thanks a lot !

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

Maybe you have to install the french language pack, more info here

https://pyimagesearch.com/2020/08/03/tesseract-ocr-for-non-english-languages/

Furthermore, you can use ocrmypdf, for me, is the easiest way to read pdfs to text: https://pypi.org/project/ocrmypdf/

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading