How can I get tesseract regular english language package for Alpine linux?

I am building a docker image based on alpine that has a dependency with tesseract for OCR. The tesseract site list two flavors of English, eng (modern english) and enm (middle english). However, I am having issues getting the eng version installed on Alpine.

My Dockerfile has the following:

FROM eclipse-temurin:17-jre-alpine as tesseract-master

RUN apk update && apk add tesseract-ocr
RUN apk update && apk add tesseract-ocr-data-eng

This fails to find the eng language package. During the build process, repo is listed and it is clear that it does not have the eng package.

I am able to install the enm package, but I feel like there will be issues since it is for middle english.

Has anyone had success installing the eng package on Alpine?

>Solution :

If you look at the content one of those packages for a language, for example the tesseract-ocr-data-enm one, you will quickly realise it contains only one file:

  • /usr/share/tessdata/enm.traineddata

Source: https://pkgs.alpinelinux.org/contents?name=tesseract-ocr-data-enm&branch=v3.17&arch=aarch64

Now, if you reverse engineer it, you can try to find which package does contains the file /usr/share/tessdata/eng.traineddata, and it is, with no big surprise, the default package: tesseract-ocr.

Source: https://pkgs.alpinelinux.org/contents?file=eng.traineddata&branch=v3.17&arch=aarch64

So, your Dockerfile should simply be:

FROM eclipse-temurin:17-jre-alpine as tesseract-master

RUN apk add --no-cache \
      tesseract-ocr

Leave a Reply