Ghostscript with Tesseract.

Building with tesseract/leptonica

First you'll need to pull in my ocr branch:

cd ghostpdl
git remote add robin MYNAME@ghostscript.com:/home/robin/repos/ghostpdl.git
git fetch robin ocr
git checkout robin/ocr

Then pull in the 2 libraries and make sure they are on the artifex branch:

git clone MYNAME@ghostscript.com:/home/robin/repos/tesseract.git
git clone MYNAME@ghostscript.com:/home/robin/repos/leptonica.git
cd tesseract
git checkout artifex
cd ../leptonica
git checkout artifex
cd ..

Next, you need training data for the languages you want - currently, only 'eng' is enabled.

wget https://github.com/tesseract-ocr/tessdata_best/raw/master/eng.traineddata tesseract/eng.traineddata

There are loads of other languages here:

https://github.com/tesseract-ocr/tessdata_best

or

https://github.com/tesseract-ocr/tessdata_fast

Then build:

./autogen.sh
make

Running

To get simple text extraction:

TESSDATA_PREFIX=tesseract debugbin/gswin32c.exe -sDEVICE=ocr -o out.txt -r200 -dLastPage=1 ../MyTests/pdf_reference17.pdf

To get HTML with hocr markup:

TESSDATA_PREFIX=tesseract debugbin/gswin32c.exe -sDEVICE=hocr -o out.html -r200 -dLastPage=1 ../MyTests/pdf_reference17.pdf

To get a PDF containing a greyscale rendering with transparent text overlay:

TESSDATA_PREFIX=tesseract debugbin/gswin32c.exe -sDEVICE=pdfocr8 -o out.pdf -r200 -dLastPage=1 ../MyTests/pdf_reference17.pdf

To get a PDF containing an rgb rendering with transparent text overlay:

TESSDATA_PREFIX=tesseract debugbin/gswin32c.exe -sDEVICE=pdfocr24 -o out.pdf -r200 -dLastPage=1 ../MyTests/pdf_reference17.pdf

To get a PDF containing a cmyk rendering with transparent text overlay:

TESSDATA_PREFIX=tesseract debugbin/gswin32c.exe -sDEVICE=pdfocr32 -o out.pdf -r200 -dLastPage=1 ../MyTests/pdf_reference17.pdf

The same params as can be used to control pdfimage8/24/32 can be used to control pdfocr8/24/32.

Still to do

The windows build spots leptonica/tesseract being there and only builds with them if they exist. I need to do the same kinda thing in the configure system for Linux builds.

We need to offer the chance to pass in a param to the devices to set what language(s) to use.

Tesseract relies on a config.h header that is currently static, but should be configured.

Tesseract loads data from TESSDATA_PREFIX. Try to find a way to make this more gs friendly.

Try to polish our tesseract hacks (avoiding duplicating memory/writing temp files/reading them back in) so they can be passed back upstream.

Look into NEON simd.

Look into minimising the leptonica build (remove unwanted read/write code).

-- Robin Watts - 2020-05-01

Comments

Edit | Attach | Watch | Print version | History: r7 | r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r1 - 2020-05-01 - RobinWatts
 
  • Edit
  • Attach
This site is powered by the TWiki collaboration platform Powered by PerlCopyright 2014 Artifex Software Inc