Difference: GhostscriptWithTesseract (5 vs. 6)

Revision 62020-08-26 - RobinWatts

Line: 1 to 1
 
META TOPICPARENT name="WebHome"

Ghostscript with Tesseract.

Headlines

  • Tesseract is a free OCR library, offering some of the best results going.
Changed:
<
<
  • It uses 'traineddata' files for each language (or multiple languages that use the same script).
  • There are 2 sets of data out there "best" and "fast". "best" ones are ~25Meg per language. "fast" ones are ~2Meg per language. A full set of "best" data for all the languages is 1.2Gig.
>
>
  • It has 2 different engines within it. The 'legacy' engine, and a modern 'LSTM' (Neural Net based) engine.
  • The legacy engine is trained on specific fonts, and can guess at what font something is. It's also good at identifying specific character positions, and does not rely on/gain from "context" to spot words.
  • The LSTM engine is faster (I think), uses smaller data sets, copes better with fonts it has not been trained on, and gains extra benefits from "context".
  • It uses 'traineddata' files for each language (or multiple languages that use the same script) - these are specific to the engine.
  • We can specify the engine (-dOCREngine=) and language files (-sOCRLanguage="eng") at runtime.

LSTM

  • There are different sets of data out there. For LSTM we have "best" and "fast". "best" ones are ~25Meg per language. "fast" ones are ~2Meg per language. A full set of "best" data for all the languages is 1.2Gig.
 
  • I envisage an OEM having either "eng" (just english), or "latin" (all the languages that use latin script - 80Meg) built in, and maybe having others available to it as extensions (perhaps as a USB key that people can plug into their printer).
  • We have 5 devices within gs that work with ocr:
    • ocr: simple text extraction
Line: 16 to 23
 
    • pdfocr32: outputs PDFs as cmyk images, with overlaid invisible OCR text for cut/paste/searching
  • Adding tesseract with inbuilt "fast" English support adds about 5.6Meg to the ARM binary size. (4.5 Meg library, 1.1 Meg compressed "eng" data).
  • OCR speeds depend on resolution and density of text. zlib.3.pdf (a typical 2 page text document) at 200 dpi page of text takes about 28 seconds on my pi 3b+, and 7.5 seconds on my desktop PC.
Added:
>
>
  • This engine is a good choice for when we are processing entire pages at a time.

Legacy

  • We have an experimental pdfwrite integration where every time pdfwrite finds a char that it doesn't know about, it renders it, and we feed that to the OCR.
  • The legacy engine does better with this, we assume because the engine does not attempt to make use of context.
  • The English traineddata file for this engine is 22 Meg.
 

Building with tesseract/leptonica

Line: 39 to 53
 wget https://github.com/tesseract-ocr/tessdata_fast/raw/master/eng.traineddata tesseract/eng.traineddata
Changed:
<
<
There are loads of other languages here:
>
>
LSTM engine data for loads of other languages/scripts can be found here:
  https://github.com/tesseract-ocr/tessdata_best
Line: 47 to 61
  https://github.com/tesseract-ocr/tessdata_fast
Changed:
<
<
Copy any language data you want built in into Resource/Tesseract/.
>
>
Legacy data can be found here:

https://github.com/tesseract-ocr/tessdata

Personally, I have the legacy data downloaded as eng-legacy.traineddata, so I can choose between 'eng' and 'eng-legacy' at runtime.

Copy any language data you want built in into Resource/Tesseract/. (I put the LSTM data in ROM and load the legacy data from disc if needed, but YMMV).

  Then build:
Line: 113 to 133
  Passing changes upstream - in progress.
Changed:
<
<
Look into NEON simd - done - still waiting for it to be accepted upstream.
>
>
Look into NEON simd - done.
 
Changed:
<
<
Look into minimising the leptonica build (remove unwanted read/write code).
>
>
Look into minimising the leptonica build (remove unwanted read/write code) - partially done.
 
Changed:
<
<
Look into minimising the tesseract memory use (avoid duplicating Pix).
>
>
Look into minimising the tesseract memory use (avoid duplicating Pix) - done.
  Look into maybe using floats instead of doubles so more can be done in neon.

Look into maybe working with the downscaler, so we can render images at (say) 300 or 600dpi, but only have to pass a 200dpi image to tesseract for OCR?

Added:
>
>
Continue pdfwrite investigations.
 -- Robin Watts - 2020-05-01

Comments

 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright 2014 Artifex Software Inc