Ghostscript with Tesseract.

Headlines

  • Tesseract is a free OCR library, offering some of the best results going.
  • It has 2 different engines within it. The 'legacy' engine, and a modern 'LSTM' (Neural Net based) engine.
  • The legacy engine is trained on specific fonts, and can guess at what font something is. It's also good at identifying specific character positions, and does not rely on/gain from "context" to spot words.
  • The LSTM engine is faster (I think), uses smaller data sets, copes better with fonts it has not been trained on, and gains extra benefits from "context".
  • It uses 'traineddata' files for each language (or multiple languages that use the same script) - these are specific to the engine.
  • We can specify the engine (-dOCREngine=) and language files (-sOCRLanguage="eng") at runtime.

LSTM

  • There are different sets of data out there. For LSTM we have "best" and "fast". "best" ones are ~25Meg per language. "fast" ones are ~2Meg per language. A full set of "best" data for all the languages is 1.2Gig.
  • I envisage an OEM having either "eng" (just english), or "latin" (all the languages that use latin script - 80Meg) built in, and maybe having others available to it as extensions (perhaps as a USB key that people can plug into their printer).
  • We have 5 devices within gs that work with ocr:
    • ocr: simple text extraction
    • hocr: "HOCR" format (XML based text extraction with positions for each char).
    • pdfocr8: outputs PDFs as greyscale images, with overlaid invisible OCR text for cut/paste/searching
    • pdfocr24: outputs PDFs as rgb images, with overlaid invisible OCR text for cut/paste/searching
    • pdfocr32: outputs PDFs as cmyk images, with overlaid invisible OCR text for cut/paste/searching
  • Adding tesseract with inbuilt "fast" English support adds about 5.6Meg to the ARM binary size. (4.5 Meg library, 1.1 Meg compressed "eng" data).
  • OCR speeds depend on resolution and density of text. zlib.3.pdf (a typical 2 page text document) at 200 dpi page of text takes about 28 seconds on my pi 3b+, and 7.5 seconds on my desktop PC.
  • This engine is a good choice for when we are processing entire pages at a time.

Legacy

  • We have an experimental pdfwrite integration where every time pdfwrite finds a char that it doesn't know about, it renders it, and we feed that to the OCR.
  • The legacy engine does better with this, we assume because the engine does not attempt to make use of context.
  • The English traineddata file for this engine is 22 Meg.

Building with tesseract/leptonica

All the gs changes are already on master.

All you need to do is to pull in the 2 libraries and make sure they are on the artifex branch:

git clone MYNAME@ghostscript.com:/home/robin/repos/tesseract.git
git clone MYNAME@ghostscript.com:/home/robin/repos/leptonica.git
cd tesseract
git checkout artifex
cd ../leptonica
git checkout artifex
cd ..

Next, you need training data for the languages you want - currently, 'eng' is used by default, but others can be used by using -sOCRLanguage="eng,ara" etc.

wget https://github.com/tesseract-ocr/tessdata_fast/raw/master/eng.traineddata tesseract/eng.traineddata

LSTM engine data for loads of other languages/scripts can be found here:

https://github.com/tesseract-ocr/tessdata_best

or

https://github.com/tesseract-ocr/tessdata_fast

Legacy data can be found here:

https://github.com/tesseract-ocr/tessdata

Personally, I have the legacy data downloaded as eng-legacy.traineddata, so I can choose between 'eng' and 'eng-legacy' at runtime.

Copy any language data you want built in into Resource/Tesseract/. (I put the LSTM data in ROM and load the legacy data from disc if needed, but YMMV).

Then build:

./autogen.sh
make

If you built the text data in with COMPILE_INITS (i.e. copied it into Resource/Tesseract) then you're sorted. If not, then you need to set TESSDATA_PREFIX to point to where the data lives. For example, if you have the data in a "tesseract" dir, you'd do:

TESSDATA_PREFIX=tesseract debugbin/gswin32c.exe ...

By default, it assumes 'eng' for the language. You can override this using -sOCRLanguage="whatever". For example, for Arabic, you'd use:

debugbin/gswin32c.exe -sOCRLanguage="ara"

and for both english and Arabic, you'd use:

debugbin/gswin32c.exe -sOCRLanguage="eng,ara"

To get simple text extraction:

debugbin/gswin32c.exe -sDEVICE=ocr -o out.txt -r200 -dLastPage=1 ../MyTests/pdf_reference17.pdf

To get HTML with hocr markup:

debugbin/gswin32c.exe -sDEVICE=hocr -o out.html -r200 -dLastPage=1 ../MyTests/pdf_reference17.pdf

To get a PDF containing a greyscale rendering with transparent text overlay:

debugbin/gswin32c.exe -sDEVICE=pdfocr8 -o out.pdf -r200 -dLastPage=1 ../MyTests/pdf_reference17.pdf

To get a PDF containing an rgb rendering with transparent text overlay:

debugbin/gswin32c.exe -sDEVICE=pdfocr24 -o out.pdf -r200 -dLastPage=1 ../MyTests/pdf_reference17.pdf

To get a PDF containing a cmyk rendering with transparent text overlay:

debugbin/gswin32c.exe -sDEVICE=pdfocr32 -o out.pdf -r200 -dLastPage=1 ../MyTests/pdf_reference17.pdf

The same params as can be used to control pdfimage8/24/32 can be used to control pdfocr8/24/32.

200dpi (fax resolution) seems a good resolution for OCR work.

Still to do

Passing changes upstream - in progress.

Look into NEON simd - done.

Look into minimising the leptonica build (remove unwanted read/write code) - partially done.

Look into minimising the tesseract memory use (avoid duplicating Pix) - done.

Look into maybe using floats instead of doubles so more can be done in neon.

Look into maybe working with the downscaler, so we can render images at (say) 300 or 600dpi, but only have to pass a 200dpi image to tesseract for OCR?

Continue pdfwrite investigations.

-- Robin Watts - 2020-05-01

Comments

Edit | Attach | Watch | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r6 - 2020-08-26 - RobinWatts
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright 2014 Artifex Software Inc