My task is to identify and extract text with strikethrough symbols from an image. I want to select only the words that have this symbol and place each instance in a list.
Code I have tried:
from PIL import Image import pytesseract # Open the image file img_path = 'path/to/image.png' img = Image.open(img_path) # Use tesseract to do OCR on the image text = pytesseract.image_to_string(img) text
The issue is that the output includes all words with no sign of a strikethrough symbol. If the string contained an indicator of a strkethrough word or phrase, such as ‘-‘, then I could further process it; however, regular pytesseract will not detect the strikethrough in this image.
A better approach will be needed.
['Once upon a time', 'Jack', 'village']
Some partial success extracting the words by looking at the confidence intervals, though the strikethrough also creates inaccuracies. This could be ameliorated by looking at the bounding box and using something like openCV to clean up the strikethrough.
# Open the image file img_path = 'path/KrDdO.png' img = Image.open(img_path) # Use tesseract to do OCR on the image text = pytesseract.image_to_data(img, output_type = 'dict') for word, conf in zip(text['text'], text['conf']): if 0 < conf < 93: print(word, conf)
Onceupon-atime, 72 Jaek 91 viage 31