<aside> 🌐 https://ghostscript.com/r/Ghostscript-OCR-Interface

</aside>

Our first experiments with using OCR within Ghostscript have centred around using the Tesseract engine. We have always intended to support other engines and to this end we have hidden all the Tesseract specifics within the ‘tessocr’ layer (implementation in base/tessocr.cpp and interface in tessocr.h).

This interface is far from cast in stone. It will unquestionably mutate slightly in future versions of Ghostscript, but it’s simple enough that if people want to try integrating new OCR engines it should be fairly straightforward.

We are amenable to modifying this interface if it makes life easier for other OCR providers.

The image data

Ghostscript will render the page to memory and pass that image across described by the following parameters:

Engine and Language

Ghostscript allows 2 pieces of configuration to be provided on the command line:

Within Tesseract the OCREngine value allows us to select which of the Tesseract “engines” we use. Similarly, OCRLanguage gives us a string that says which ‘traineddata’ files to feed to those engines.

Other OCR providers could use those same arguments to get other arbitrary data in; perhaps a comma-separated list of options if that is more appropriate.

Initialisation and Shutdown