Enabling Tesseract For Ghostscript 9.53

Ghostscript 9.53 contains preliminary support for OCR devices.

It relies upon the open-source Tesseract and Leptonica libraries to achieve this. Because of the size of the code, we do not currently ship Tesseract and/or Leptonica in the standard release build. If you wish to try the support out, you will need to build your own version of Ghostscript with this support included. This page gives you step by step instructions of what to do.

First, fetch the Tesseract source.

In general, Ghostscript uses a slightly modified version of the Tesseract source, kept on the ‘artifex’ branch in the following git repository:

https://git.ghostscript.com/?p=thirdparty-tesseract.git;a=shortlog;h=refs/heads/artifex

For the Ghostscript 9.53 release, you can download a snapshot of this source here.

Download that, and unpack it into a directory called ‘tesseract’ within the ghostpdl sources.

Next, fetch the Leptonica source.

In general, Ghostscript uses a slightly modified version of the Leptonica source, kept on the ‘artifex’ branch in the following git repository:

https://git.ghostscript.com/?p=thirdparty-leptonica.git;a=shortlog;h=refs/heads/artifex

For the Ghostscript 9.53 release, you can download a snapshot of this source here.

Download that, and unpack it into a directory called ‘leptonica’ within the ghostpdl sources.

Fetch some traineddata.

Tesseract relies on encapsulated knowledge so it can recognise particular languages and/or scripts. This knowledge comes in the form of ‘traineddata’ files. In order for Tesseract to work, it must have access to the appropriate ‘traineddata’ file for the selected language(s).

To complicate matters further, Tesseract can be built with different engines. These engines work in different ways, and hence need different information in the ‘traineddata’ file. It is therefore important to match the traineddata file you have with the build of Tesseract that you are using. Currently, by default, Ghostscript uses the “LSTM” engine (aka the ‘modern’ engine). The alternative is the ‘legacy’ engine. You can switch what engine is used by using the -dOCREngine= flag when you call Ghostscript. Details can be found in the Ghostscript documentation, and we will not deal with this more here.

Traineddata files are created by training Tesseract on a range of inputs. This is an involved and painstaking process that we will not cover here.

Fortunately, various sources exist on the net for getting ready prepared traineddata files.

https://github.com/tesseract-ocr/tessdata_best - ‘best’ data available for the LSTM engine.