Ghostscript OCR Interface

<aside> 🌐 https://ghostscript.com/r/Ghostscript-OCR-Interface

</aside>

Our first experiments with using OCR within Ghostscript have centred around using the Tesseract engine. We have always intended to support other engines and to this end we have hidden all the Tesseract specifics within the ‘tessocr’ layer (implementation in base/tessocr.cpp and interface in tessocr.h).

This interface is far from cast in stone. It will unquestionably mutate slightly in future versions of Ghostscript, but it’s simple enough that if people want to try integrating new OCR engines it should be fairly straightforward.

We are amenable to modifying this interface if it makes life easier for other OCR providers.

The image data

Ghostscript will render the page to memory and pass that image across described by the following parameters:

data: A const void * pointer to the pixel data for the top left pixel.
w: The number of pixels across the image.
h: The number of pixels down the image.
bpp: The number of bits per pixel (8, 24 or 32, for grey, rgb, and cmyk respectively).
raster: The number of bytes to be added to the address of storage for a pixel to get to the storage for the same pixel on the line below.
xres: x resolution (in dots per inch)
yres: y resolution (in dots per inch)

Engine and Language

Ghostscript allows 2 pieces of configuration to be provided on the command line:

dOCREngine=<integer>
sOCRLanguage=<string>

Within Tesseract the OCREngine value allows us to select which of the Tesseract “engines” we use. Similarly, OCRLanguage gives us a string that says which ‘traineddata’ files to feed to those engines.

Other OCR providers could use those same arguments to get other arbitrary data in; perhaps a comma-separated list of options if that is more appropriate.

The image data

Engine and Language

Initialisation and Shutdown