Ghostscript with Tesseract

</aside>

Headlines

Tesseract is a free OCR library, offering some of the best results going.
It has 2 different engines within it. The ‘legacy’ engine, and a modern ‘LSTM’ (Neural Net based) engine.
The legacy engine is trained on specific fonts, and can guess at what font something is. It’s also good at identifying specific character positions, and does not rely on/gain from “context” to spot words.
The LSTM engine is faster (I think), uses smaller data sets, copes better with fonts it has not been trained on, and gains extra benefits from “context”.
It uses ‘traineddata’ files for each language (or multiple languages that use the same script) - these are specific to the engine.
We can specify the engine (-dOCREngine=) and language files (-sOCRLanguage=“eng”) at runtime.

There are different sets of data out there. For LSTM we have “best” and “fast”. “best” ones are ~25Meg per language. “fast” ones are ~2Meg per language. A full set of “best” data for all the languages is 1.2Gig.
I envisage an OEM having either “eng” (just english), or “latin” (all the languages that use latin script - 80Meg) built in, and maybe having others available to it as extensions (perhaps as a USB key that people can plug into their printer).
We have 5 devices within gs that work with ocr:
ocr: simple text extraction
hocr: “HOCR” format (XML based text extraction with positions for each char).
pdfocr8: outputs PDFs as greyscale images, with overlaid invisible OCR text for cut/paste/searching
pdfocr24: outputs PDFs as rgb images, with overlaid invisible OCR text for cut/paste/searching
pdfocr32: outputs PDFs as cmyk images, with overlaid invisible OCR text for cut/paste/searching
Adding tesseract with inbuilt “fast” English support adds about 5.6Meg to the ARM binary size. (4.5 Meg library, 1.1 Meg compressed “eng” data).
OCR speeds depend on resolution and density of text. zlib.3.pdf (a typical 2 page text document) at 200 dpi page of text takes about 28 seconds on my pi 3b+, and 7.5 seconds on my desktop PC.
This engine is a good choice for when we are processing entire pages at a time.