Tesseract OCR accents problems, image enhancement not enough

September 16, 2022

I really need your help with Tesseract.
I’m using Tesseract and pdf2image to extract informations from a scanned PDF file.
My problem is that Tesseract messes with the accents é, è et ê (i’m french) and with the lowercase "i" and upcase "I".
I tried processing the images first but can’t get any good output.

This the code i’m using:

pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe' 

filePath = askopenfilename()
img = convert_from_path(filePath,poppler_path=r'C:\poppler-0.68.0_x86\poppler-0.68.0\bin')
path, fileName = os.path.split(filePath)
fileBaseName, fileExtension = os.path.splitext(fileName)


for page_number in range(len(img)):
    img[page_number].save(r'C:\Users\488096\Documents\page'+ str(page_number) +'.jpg', 'JPEG')

    
work_img = None
# Tesseract
custom_config = r'--oem 3 --psm 6'
kernel = np.ones((1, 1), np.uint8)

for page_number in range(len(img)):
    img1 = cv2.imread(r'C:\Users\488096\Documents\page'+ str(page_number) +'.jpg')
    #Traitement des images afin d'obtenir une meilleure reconnaissance des caractères
    gray = cv2.cvtColor(img1, cv2.COLOR_BGR2GRAY)
    # Remove shadows
    cool_img = cv2.dilate(gray, kernel, iterations=1)
    norm_img = cv2.erode(cool_img, kernel, iterations=1)
    # Threshold using Otsu's
    work_img = cv2.threshold(cv2.bilateralFilter(norm_img, 5, 75, 75), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

    # Save pages as images in the pdf
    txt = txt + (pytesseract.image_to_string(work_img,config=custom_config).encode("utf-8")).decode('utf-8')
    print("Page # {} - {}".format(str(page_number),txt))

What can I do to obtain good results ?
Thanks a lot !