Difference: GhostscriptWithTesseract (1 vs. 7)

Revision 72020-11-26 - RobinWatts

Line: 1 to 1
 
META TOPICPARENT name="WebHome"

Ghostscript with Tesseract.

Line: 47 to 47
 cd ..
Changed:
<
<
Next, you need training data for the languages you want - currently, 'eng' is used by default, but others can be used by using -sOCRLanguage="eng,ara" etc.
>
>
Next, you need training data for the languages you want - currently, 'eng' is used by default, but others can be used by using -sOCRLanguage="eng+ara" etc.
 
wget https://github.com/tesseract-ocr/tessdata_fast/raw/master/eng.traineddata tesseract/eng.traineddata
Line: 92 to 92
 and for both english and Arabic, you'd use:
Changed:
<
<
debugbin/gswin32c.exe -sOCRLanguage="eng,ara"
>
>
debugbin/gswin32c.exe -sOCRLanguage="eng+ara"
 

To get simple text extraction:

Line: 131 to 131
 

Still to do

Changed:
<
<
Passing changes upstream - in progress.
>
>
Passing changes upstream - done.
  Look into NEON simd - done.

Revision 62020-08-26 - RobinWatts

Line: 1 to 1
 
META TOPICPARENT name="WebHome"

Ghostscript with Tesseract.

Headlines

  • Tesseract is a free OCR library, offering some of the best results going.
Changed:
<
<
  • It uses 'traineddata' files for each language (or multiple languages that use the same script).
  • There are 2 sets of data out there "best" and "fast". "best" ones are ~25Meg per language. "fast" ones are ~2Meg per language. A full set of "best" data for all the languages is 1.2Gig.
>
>
  • It has 2 different engines within it. The 'legacy' engine, and a modern 'LSTM' (Neural Net based) engine.
  • The legacy engine is trained on specific fonts, and can guess at what font something is. It's also good at identifying specific character positions, and does not rely on/gain from "context" to spot words.
  • The LSTM engine is faster (I think), uses smaller data sets, copes better with fonts it has not been trained on, and gains extra benefits from "context".
  • It uses 'traineddata' files for each language (or multiple languages that use the same script) - these are specific to the engine.
  • We can specify the engine (-dOCREngine=) and language files (-sOCRLanguage="eng") at runtime.

LSTM

  • There are different sets of data out there. For LSTM we have "best" and "fast". "best" ones are ~25Meg per language. "fast" ones are ~2Meg per language. A full set of "best" data for all the languages is 1.2Gig.
 
  • I envisage an OEM having either "eng" (just english), or "latin" (all the languages that use latin script - 80Meg) built in, and maybe having others available to it as extensions (perhaps as a USB key that people can plug into their printer).
  • We have 5 devices within gs that work with ocr:
    • ocr: simple text extraction
Line: 16 to 23
 
    • pdfocr32: outputs PDFs as cmyk images, with overlaid invisible OCR text for cut/paste/searching
  • Adding tesseract with inbuilt "fast" English support adds about 5.6Meg to the ARM binary size. (4.5 Meg library, 1.1 Meg compressed "eng" data).
  • OCR speeds depend on resolution and density of text. zlib.3.pdf (a typical 2 page text document) at 200 dpi page of text takes about 28 seconds on my pi 3b+, and 7.5 seconds on my desktop PC.
Added:
>
>
  • This engine is a good choice for when we are processing entire pages at a time.

Legacy

  • We have an experimental pdfwrite integration where every time pdfwrite finds a char that it doesn't know about, it renders it, and we feed that to the OCR.
  • The legacy engine does better with this, we assume because the engine does not attempt to make use of context.
  • The English traineddata file for this engine is 22 Meg.
 

Building with tesseract/leptonica

Line: 39 to 53
 wget https://github.com/tesseract-ocr/tessdata_fast/raw/master/eng.traineddata tesseract/eng.traineddata
Changed:
<
<
There are loads of other languages here:
>
>
LSTM engine data for loads of other languages/scripts can be found here:
  https://github.com/tesseract-ocr/tessdata_best
Line: 47 to 61
  https://github.com/tesseract-ocr/tessdata_fast
Changed:
<
<
Copy any language data you want built in into Resource/Tesseract/.
>
>
Legacy data can be found here:

https://github.com/tesseract-ocr/tessdata

Personally, I have the legacy data downloaded as eng-legacy.traineddata, so I can choose between 'eng' and 'eng-legacy' at runtime.

Copy any language data you want built in into Resource/Tesseract/. (I put the LSTM data in ROM and load the legacy data from disc if needed, but YMMV).

  Then build:
Line: 113 to 133
  Passing changes upstream - in progress.
Changed:
<
<
Look into NEON simd - done - still waiting for it to be accepted upstream.
>
>
Look into NEON simd - done.
 
Changed:
<
<
Look into minimising the leptonica build (remove unwanted read/write code).
>
>
Look into minimising the leptonica build (remove unwanted read/write code) - partially done.
 
Changed:
<
<
Look into minimising the tesseract memory use (avoid duplicating Pix).
>
>
Look into minimising the tesseract memory use (avoid duplicating Pix) - done.
  Look into maybe using floats instead of doubles so more can be done in neon.

Look into maybe working with the downscaler, so we can render images at (say) 300 or 600dpi, but only have to pass a 200dpi image to tesseract for OCR?

Added:
>
>
Continue pdfwrite investigations.
 -- Robin Watts - 2020-05-01

Comments

Revision 52020-06-18 - RobinWatts

Line: 1 to 1
 
META TOPICPARENT name="WebHome"

Ghostscript with Tesseract.

Line: 33 to 33
 cd ..
Changed:
<
<
Next, you need training data for the languages you want - currently, only 'eng' is enabled.
>
>
Next, you need training data for the languages you want - currently, 'eng' is used by default, but others can be used by using -sOCRLanguage="eng,ara" etc.
 
wget https://github.com/tesseract-ocr/tessdata_fast/raw/master/eng.traineddata tesseract/eng.traineddata

Revision 42020-06-17 - RobinWatts

Line: 1 to 1
 
META TOPICPARENT name="WebHome"

Ghostscript with Tesseract.

Line: 19 to 19
 

Building with tesseract/leptonica

Changed:
<
<
First you'll need to pull in my ocr branch:
>
>
All the gs changes are already on master.
 
Changed:
<
<
cd ghostpdl
git remote add robin MYNAME@ghostscript.com:/home/robin/repos/ghostpdl.git
git fetch robin ocr
git checkout robin/ocr

Then pull in the 2 libraries and make sure they are on the artifex branch:

>
>
All you need to do is to pull in the 2 libraries and make sure they are on the artifex branch:
 
git clone MYNAME@ghostscript.com:/home/robin/repos/tesseract.git

Revision 32020-05-15 - RobinWatts

Line: 1 to 1
 
META TOPICPARENT name="WebHome"

Ghostscript with Tesseract.

Line: 15 to 15
 
    • pdfocr24: outputs PDFs as rgb images, with overlaid invisible OCR text for cut/paste/searching
    • pdfocr32: outputs PDFs as cmyk images, with overlaid invisible OCR text for cut/paste/searching
  • Adding tesseract with inbuilt "fast" English support adds about 5.6Meg to the ARM binary size. (4.5 Meg library, 1.1 Meg compressed "eng" data).
Changed:
<
<
  • OCR speeds depend on resolution and density of text. zlib.3.pdf (a typical 2 page text document) at 200 dpi page of text takes about 28 seconds on my pi 3b+, and 15 seconds on my desktop PC.
>
>
  • OCR speeds depend on resolution and density of text. zlib.3.pdf (a typical 2 page text document) at 200 dpi page of text takes about 28 seconds on my pi 3b+, and 7.5 seconds on my desktop PC.
 

Building with tesseract/leptonica

Revision 22020-05-15 - RobinWatts

Line: 1 to 1
 
META TOPICPARENT name="WebHome"

Ghostscript with Tesseract.

Added:
>
>

Headlines

  • Tesseract is a free OCR library, offering some of the best results going.
  • It uses 'traineddata' files for each language (or multiple languages that use the same script).
  • There are 2 sets of data out there "best" and "fast". "best" ones are ~25Meg per language. "fast" ones are ~2Meg per language. A full set of "best" data for all the languages is 1.2Gig.
  • I envisage an OEM having either "eng" (just english), or "latin" (all the languages that use latin script - 80Meg) built in, and maybe having others available to it as extensions (perhaps as a USB key that people can plug into their printer).
  • We have 5 devices within gs that work with ocr:
    • ocr: simple text extraction
    • hocr: "HOCR" format (XML based text extraction with positions for each char).
    • pdfocr8: outputs PDFs as greyscale images, with overlaid invisible OCR text for cut/paste/searching
    • pdfocr24: outputs PDFs as rgb images, with overlaid invisible OCR text for cut/paste/searching
    • pdfocr32: outputs PDFs as cmyk images, with overlaid invisible OCR text for cut/paste/searching
  • Adding tesseract with inbuilt "fast" English support adds about 5.6Meg to the ARM binary size. (4.5 Meg library, 1.1 Meg compressed "eng" data).
  • OCR speeds depend on resolution and density of text. zlib.3.pdf (a typical 2 page text document) at 200 dpi page of text takes about 28 seconds on my pi 3b+, and 15 seconds on my desktop PC.
 

Building with tesseract/leptonica

First you'll need to pull in my ocr branch:

Line: 28 to 43
 Next, you need training data for the languages you want - currently, only 'eng' is enabled.
Changed:
<
<
wget https://github.com/tesseract-ocr/tessdata_best/raw/master/eng.traineddata tesseract/eng.traineddata
>
>
wget https://github.com/tesseract-ocr/tessdata_fast/raw/master/eng.traineddata tesseract/eng.traineddata
 

There are loads of other languages here:

Line: 39 to 54
  https://github.com/tesseract-ocr/tessdata_fast
Added:
>
>
Copy any language data you want built in into Resource/Tesseract/.
 Then build:
Line: 46 to 63
 make
Changed:
<
<

Running

>
>

If you built the text data in with COMPILE_INITS (i.e. copied it into Resource/Tesseract) then you're sorted. If not, then you need to set TESSDATA_PREFIX to point to where the data lives. For example, if you have the data in a "tesseract" dir, you'd do:

TESSDATA_PREFIX=tesseract debugbin/gswin32c.exe ...

By default, it assumes 'eng' for the language. You can override this using -sOCRLanguage="whatever". For example, for Arabic, you'd use:

debugbin/gswin32c.exe -sOCRLanguage="ara"

and for both english and Arabic, you'd use:

debugbin/gswin32c.exe -sOCRLanguage="eng,ara"
  To get simple text extraction:
Changed:
<
<
TESSDATA_PREFIX=tesseract debugbin/gswin32c.exe -sDEVICE=ocr -o out.txt -r200 -dLastPage=1 ../MyTests/pdf_reference17.pdf
>
>
debugbin/gswin32c.exe -sDEVICE=ocr -o out.txt -r200 -dLastPage=1 ../MyTests/pdf_reference17.pdf
 

To get HTML with hocr markup:

Changed:
<
<
TESSDATA_PREFIX=tesseract debugbin/gswin32c.exe -sDEVICE=hocr -o out.html -r200 -dLastPage=1 ../MyTests/pdf_reference17.pdf
>
>
debugbin/gswin32c.exe -sDEVICE=hocr -o out.html -r200 -dLastPage=1 ../MyTests/pdf_reference17.pdf
 

To get a PDF containing a greyscale rendering with transparent text overlay:

Changed:
<
<
TESSDATA_PREFIX=tesseract debugbin/gswin32c.exe -sDEVICE=pdfocr8 -o out.pdf -r200 -dLastPage=1 ../MyTests/pdf_reference17.pdf
>
>
debugbin/gswin32c.exe -sDEVICE=pdfocr8 -o out.pdf -r200 -dLastPage=1 ../MyTests/pdf_reference17.pdf
 

To get a PDF containing an rgb rendering with transparent text overlay:

Changed:
<
<
TESSDATA_PREFIX=tesseract debugbin/gswin32c.exe -sDEVICE=pdfocr24 -o out.pdf -r200 -dLastPage=1 ../MyTests/pdf_reference17.pdf
>
>
debugbin/gswin32c.exe -sDEVICE=pdfocr24 -o out.pdf -r200 -dLastPage=1 ../MyTests/pdf_reference17.pdf
 

To get a PDF containing a cmyk rendering with transparent text overlay:

Changed:
<
<
TESSDATA_PREFIX=tesseract debugbin/gswin32c.exe -sDEVICE=pdfocr32 -o out.pdf -r200 -dLastPage=1 ../MyTests/pdf_reference17.pdf
>
>
debugbin/gswin32c.exe -sDEVICE=pdfocr32 -o out.pdf -r200 -dLastPage=1 ../MyTests/pdf_reference17.pdf
 

The same params as can be used to control pdfimage8/24/32 can be used to control pdfocr8/24/32.

Changed:
<
<

Still to do

>
>
200dpi (fax resolution) seems a good resolution for OCR work.
 
Changed:
<
<
The windows build spots leptonica/tesseract being there and only builds with them if they exist. I need to do the same kinda thing in the configure system for Linux builds.
>
>

Still to do

 
Changed:
<
<
We need to offer the chance to pass in a param to the devices to set what language(s) to use.
>
>
Passing changes upstream - in progress.
 
Changed:
<
<
Tesseract relies on a config.h header that is currently static, but should be configured.
>
>
Look into NEON simd - done - still waiting for it to be accepted upstream.
 
Changed:
<
<
Tesseract loads data from TESSDATA_PREFIX. Try to find a way to make this more gs friendly.
>
>
Look into minimising the leptonica build (remove unwanted read/write code).
 
Changed:
<
<
Try to polish our tesseract hacks (avoiding duplicating memory/writing temp files/reading them back in) so they can be passed back upstream.
>
>
Look into minimising the tesseract memory use (avoid duplicating Pix).
 
Changed:
<
<
Look into NEON simd.
>
>
Look into maybe using floats instead of doubles so more can be done in neon.
 
Changed:
<
<
Look into minimising the leptonica build (remove unwanted read/write code).
>
>
Look into maybe working with the downscaler, so we can render images at (say) 300 or 600dpi, but only have to pass a 200dpi image to tesseract for OCR?
  -- Robin Watts - 2020-05-01

Revision 12020-05-01 - RobinWatts

Line: 1 to 1
Added:
>
>
META TOPICPARENT name="WebHome"

Ghostscript with Tesseract.

Building with tesseract/leptonica

First you'll need to pull in my ocr branch:

cd ghostpdl
git remote add robin MYNAME@ghostscript.com:/home/robin/repos/ghostpdl.git
git fetch robin ocr
git checkout robin/ocr

Then pull in the 2 libraries and make sure they are on the artifex branch:

git clone MYNAME@ghostscript.com:/home/robin/repos/tesseract.git
git clone MYNAME@ghostscript.com:/home/robin/repos/leptonica.git
cd tesseract
git checkout artifex
cd ../leptonica
git checkout artifex
cd ..

Next, you need training data for the languages you want - currently, only 'eng' is enabled.

wget https://github.com/tesseract-ocr/tessdata_best/raw/master/eng.traineddata tesseract/eng.traineddata

There are loads of other languages here:

https://github.com/tesseract-ocr/tessdata_best

or

https://github.com/tesseract-ocr/tessdata_fast

Then build:

./autogen.sh
make

Running

To get simple text extraction:

TESSDATA_PREFIX=tesseract debugbin/gswin32c.exe -sDEVICE=ocr -o out.txt -r200 -dLastPage=1 ../MyTests/pdf_reference17.pdf

To get HTML with hocr markup:

TESSDATA_PREFIX=tesseract debugbin/gswin32c.exe -sDEVICE=hocr -o out.html -r200 -dLastPage=1 ../MyTests/pdf_reference17.pdf

To get a PDF containing a greyscale rendering with transparent text overlay:

TESSDATA_PREFIX=tesseract debugbin/gswin32c.exe -sDEVICE=pdfocr8 -o out.pdf -r200 -dLastPage=1 ../MyTests/pdf_reference17.pdf

To get a PDF containing an rgb rendering with transparent text overlay:

TESSDATA_PREFIX=tesseract debugbin/gswin32c.exe -sDEVICE=pdfocr24 -o out.pdf -r200 -dLastPage=1 ../MyTests/pdf_reference17.pdf

To get a PDF containing a cmyk rendering with transparent text overlay:

TESSDATA_PREFIX=tesseract debugbin/gswin32c.exe -sDEVICE=pdfocr32 -o out.pdf -r200 -dLastPage=1 ../MyTests/pdf_reference17.pdf

The same params as can be used to control pdfimage8/24/32 can be used to control pdfocr8/24/32.

Still to do

The windows build spots leptonica/tesseract being there and only builds with them if they exist. I need to do the same kinda thing in the configure system for Linux builds.

We need to offer the chance to pass in a param to the devices to set what language(s) to use.

Tesseract relies on a config.h header that is currently static, but should be configured.

Tesseract loads data from TESSDATA_PREFIX. Try to find a way to make this more gs friendly.

Try to polish our tesseract hacks (avoiding duplicating memory/writing temp files/reading them back in) so they can be passed back upstream.

Look into NEON simd.

Look into minimising the leptonica build (remove unwanted read/write code).

-- Robin Watts - 2020-05-01

Comments

<--/commentPlugin-->
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright 2014 Artifex Software Inc