<aside>
đ https://ghostscript.com/r/Ghostscript-with-Tesseract
</aside>
Headlines
- Tesseract is a free OCR library, offering some of the best results going.
- It has 2 different engines within it. The âlegacyâ engine, and a modern âLSTMâ (Neural Net based) engine.
- The legacy engine is trained on specific fonts, and can guess at what font something is. Itâs also good at identifying specific character positions, and does not rely on/gain from âcontextâ to spot words.
- The LSTM engine is faster (I think), uses smaller data sets, copes better with fonts it has not been trained on, and gains extra benefits from âcontextâ.
- It uses âtraineddataâ files for each language (or multiple languages that use the same script) - these are specific to the engine.
- We can specify the engine (-dOCREngine=) and language files (-sOCRLanguage=âengâ) at runtime.
LSTM
- There are different sets of data out there. For LSTM we have âbestâ and âfastâ. âbestâ ones are ~25Meg per language. âfastâ ones are ~2Meg per language. A full set of âbestâ data for all the languages is 1.2Gig.
- I envisage an OEM having either âengâ (just english), or âlatinâ (all the languages that use latin script - 80Meg) built in, and maybe having others available to it as extensions (perhaps as a USB key that people can plug into their printer).
- We have 5 devices within gs that work with ocr:
- ocr: simple text extraction
- hocr: âHOCRâ format (XML based text extraction with positions for each char).
- pdfocr8: outputs PDFs as greyscale images, with overlaid invisible OCR text for cut/paste/searching
- pdfocr24: outputs PDFs as rgb images, with overlaid invisible OCR text for cut/paste/searching
- pdfocr32: outputs PDFs as cmyk images, with overlaid invisible OCR text for cut/paste/searching
- Adding tesseract with inbuilt âfastâ English support adds about 5.6Meg to the ARM binary size. (4.5 Meg library, 1.1 Meg compressed âengâ data).
- OCR speeds depend on resolution and density of text. zlib.3.pdf (a typical 2 page text document) at 200 dpi page of text takes about 28 seconds on my pi 3b+, and 7.5 seconds on my desktop PC.
- This engine is a good choice for when we are processing entire pages at a time.