Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to properly extract Japanese txt from PDF files

I need to extract the text from the pdf files.

The problem is some pages of the files is the scanned pdf, which the text can’t be retrieved using the PyPDF or PDFMiner. So the text is empty.

Could anyone please give me a hint of how to process?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

I don’t think there’s a quick solution to deal with the Unicode, especially the Japanese.

One of a solution that we could go:

  • Iterate over the page, determine whether the page is scanned pdf or not. This could be done using the PyMUPDF, take a look at this answer.
  • If the page is not scanned pdf, we can extract the text from pdf as usual.
  • For the page which is not scanned pdf, we can convert the pdf into .png image using the pdf2image, than use pytesseract to extract data. Here by the sample code on how to read the data from image.
  • You might need to do some extra data work in order to get the properly words.
import cv2
import pytesseract
from pytesseract import Output

img = cv2.imread('invoice-sample.jpg')

d = pytesseract.image_to_data(img, output_type=Output.DICT)
print(d.keys())

Regarding the tesseract, you can find more in this article.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading