Enabling Tesseract For Ghostscript 9.53

Ghostscript 9.53 contains preliminary support for OCR devices.

It relies upon the open-source Tesseract and Leptonica libraries to achieve this. Because of the size of the code, we do not currently ship Tesseract and/or Leptonica in the standard release build. If you wish to try the support out, you will need to build your own version of Ghostscript with this support included. This page gives you step by step instructions of what to do.

First, fetch the Tesseract source.

In general, Ghostscript uses a slightly modified version of the Tesseract source, kept on the 'artifex' branch in the following git repository:

https://git.ghostscript.com/?p=thirdparty-tesseract.git;a=shortlog;h=refs/heads/artifex

For the Ghostscript 9.53 release, you can download a snapshot of this source here.

Download that, and unpack it into a directory called 'tesseract' within the ghostpdl sources.

Next, fetch the Leptonica source.

In general, Ghostscript uses a slightly modified version of the Leptonica source, kept on the 'artifex' branch in the following git repository:

https://git.ghostscript.com/?p=thirdparty-leptonica.git;a=shortlog;h=refs/heads/artifex

For the Ghostscript 9.53 release, you can download a snapshot of this source here.

Download that, and unpack it into a directory called 'leptonica' within the ghostpdl sources.

Fetch some traineddata.

Tesseract relies on encapsulated knowledge so it can recognise particular languages and/or scripts. This knowledge comes in the form of 'traineddata' files. In order for Tesseract to work, it must have access to the appropriate 'traineddata' file for the selected language(s).

To complicate matters further, Tesseract can be built with different engines. These engines work in different ways, and hence need different information in the 'traineddata' file. It is therefore important to match the traineddata file you have with the build of Tesseract that you are using. Currently, by default, Ghostscript uses the "LSTM" engine (aka the 'modern' engine). The alternative is the 'legacy' engine. You can switch what engine is used by using the -dOCREngine= flag when you call Ghostscript. Details can be found in the Ghostscript documentation, and we will not deal with this more here.

Traineddata files are created by training Tesseract on a range of inputs. This is an involved and painstaking process that we will not cover here.

Fortunately, various sources exist on the net for getting ready prepared traineddata files.

By default, the Ghostscript OCR devices have OCRLanguage set to 'eng', thus the system will need 'eng.traineddata' in order to be able to run.

Now, you have a choice. You can either build your traineddata file(s) into the Ghostscript executable, or you can make them available on disc.

To build them into the executable, simply create a 'Tesseract' directory within the 'Resource' directory on disc (noting capitalisation!) and store your traineddata file(s) there.

If you would rather make them available on disc, then either you can put them into the current directory when Ghostscript is run, or you can set the environment variable 'TESSDATA_PREFIX' to point to the directory in which they live.

Rebuild Ghostscript.

Do a full rebuild of Ghostscript.

On windows, use the 'Rebuild' option from the MSVC solution.

On unix, rerun the configure step if working from a release (or rerun autogen.sh if working from git). Then make as usual.

This should leave you with a working copy of Ghostscript that supports tesseract.

Run a test

On windows, run:

bin/gswin32c.exe -sDEVICE=pdfocr8 -o out.pdf -r600 -dDownScaleFactor=3 zlib/zlib.3.pdf

On unix, run:

bin/gs -sDEVICE=pdfocr8 -o out.pdf -r600 -dDownScaleFactor=3 zlib/zlib.3.pdf

And you should hopefully get an out.pdf created with the contents of zlib/zlib.3.pdf rendered and OCRd within it.

Give us feedback

Please let us know how this works for you. The future of these devices will depend upon what feedback we get. Please let us know what they do well for you, what they do badly, what they don't do, but really should, etc.

-- Robin Watts - 2020-09-17

Comments

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r2 - 2020-09-17 - RobinWatts
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright 2014 Artifex Software Inc