Tags:
create new tag
view all tags

Ghostscript with Tesseract.

Headlines

  • Tesseract is a free OCR library, offering some of the best results going.
  • It uses 'traineddata' files for each language (or multiple languages that use the same script).
  • There are 2 sets of data out there "best" and "fast". "best" ones are ~25Meg per language. "fast" ones are ~2Meg per language. A full set of "best" data for all the languages is 1.2Gig.
  • I envisage an OEM having either "eng" (just english), or "latin" (all the languages that use latin script - 80Meg) built in, and maybe having others available to it as extensions (perhaps as a USB key that people can plug into their printer).
  • We have 5 devices within gs that work with ocr:
    • ocr: simple text extraction
    • hocr: "HOCR" format (XML based text extraction with positions for each char).
    • pdfocr8: outputs PDFs as greyscale images, with overlaid invisible OCR text for cut/paste/searching
    • pdfocr24: outputs PDFs as rgb images, with overlaid invisible OCR text for cut/paste/searching
    • pdfocr32: outputs PDFs as cmyk images, with overlaid invisible OCR text for cut/paste/searching
  • Adding tesseract with inbuilt "fast" English support adds about 5.6Meg to the ARM binary size. (4.5 Meg library, 1.1 Meg compressed "eng" data).
  • OCR speeds depend on resolution and density of text. zlib.3.pdf (a typical 2 page text document) at 200 dpi page of text takes about 28 seconds on my pi 3b+, and 7.5 seconds on my desktop PC.

Building with tesseract/leptonica

First you'll need to pull in my ocr branch:

cd ghostpdl
git remote add robin MYNAME@ghostscript.com:/home/robin/repos/ghostpdl.git
git fetch robin ocr
git checkout robin/ocr

Then pull in the 2 libraries and make sure they are on the artifex branch:

git clone MYNAME@ghostscript.com:/home/robin/repos/tesseract.git
git clone MYNAME@ghostscript.com:/home/robin/repos/leptonica.git
cd tesseract
git checkout artifex
cd ../leptonica
git checkout artifex
cd ..

Next, you need training data for the languages you want - currently, only 'eng' is enabled.

wget https://github.com/tesseract-ocr/tessdata_fast/raw/master/eng.traineddata tesseract/eng.traineddata

There are loads of other languages here:

https://github.com/tesseract-ocr/tessdata_best

or

https://github.com/tesseract-ocr/tessdata_fast

Copy any language data you want built in into Resource/Tesseract/.

Then build:

./autogen.sh
make

If you built the text data in with COMPILE_INITS (i.e. copied it into Resource/Tesseract) then you're sorted. If not, then you need to set TESSDATA_PREFIX to point to where the data lives. For example, if you have the data in a "tesseract" dir, you'd do:

TESSDATA_PREFIX=tesseract debugbin/gswin32c.exe ...

By default, it assumes 'eng' for the language. You can override this using -sOCRLanguage="whatever". For example, for Arabic, you'd use:

debugbin/gswin32c.exe -sOCRLanguage="ara"

and for both english and Arabic, you'd use:

debugbin/gswin32c.exe -sOCRLanguage="eng,ara"

To get simple text extraction:

debugbin/gswin32c.exe -sDEVICE=ocr -o out.txt -r200 -dLastPage=1 ../MyTests/pdf_reference17.pdf

To get HTML with hocr markup:

debugbin/gswin32c.exe -sDEVICE=hocr -o out.html -r200 -dLastPage=1 ../MyTests/pdf_reference17.pdf

To get a PDF containing a greyscale rendering with transparent text overlay:

debugbin/gswin32c.exe -sDEVICE=pdfocr8 -o out.pdf -r200 -dLastPage=1 ../MyTests/pdf_reference17.pdf

To get a PDF containing an rgb rendering with transparent text overlay:

debugbin/gswin32c.exe -sDEVICE=pdfocr24 -o out.pdf -r200 -dLastPage=1 ../MyTests/pdf_reference17.pdf

To get a PDF containing a cmyk rendering with transparent text overlay:

debugbin/gswin32c.exe -sDEVICE=pdfocr32 -o out.pdf -r200 -dLastPage=1 ../MyTests/pdf_reference17.pdf

The same params as can be used to control pdfimage8/24/32 can be used to control pdfocr8/24/32.

200dpi (fax resolution) seems a good resolution for OCR work.

Still to do

Passing changes upstream - in progress.

Look into NEON simd - done - still waiting for it to be accepted upstream.

Look into minimising the leptonica build (remove unwanted read/write code).

Look into minimising the tesseract memory use (avoid duplicating Pix).

Look into maybe using floats instead of doubles so more can be done in neon.

Look into maybe working with the downscaler, so we can render images at (say) 300 or 600dpi, but only have to pass a 200dpi image to tesseract for OCR?

-- Robin Watts - 2020-05-01

Comments

Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r3 - 2020-05-15 - RobinWatts
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright 2014 Artifex Software Inc