Tags:
create new tag
view all tags

GhostscriptOCRInterface

Our first experiments with using OCR within Ghostscript have centred around using the Tesseract engine. We have always intended to support other engines and to this end we have hidden all the Tesseract specifics within the 'tessocr' layer (implementation in base/tessocr.cpp and interface in tessocr.h).

This interface is far from cast in stone. It will unquestionably mutate slightly in future versions of Ghostscript, but it's simple enough that if people want to try integrating new OCR engines it should be fairly straightforward.

We are amenable to modifying this interface if it makes life easier for other OCR providers.

The image data

Ghostscript will render the page to memory and pass that image across described by the following parameters:

  • data: A const void * pointer to the pixel data for the top left pixel.
  • w: The number of pixels across the image.
  • h: The number of pixels down the image.
  • bpp: The number of bits per pixel (8, 24 or 32, for grey, rgb, and cmyk respectively).
  • raster: The number of bytes to be added to the address of storage for a pixel to get to the storage for the same pixel on the line below.
  • xres: x resolution (in dots per inch)
  • yres: y resolution (in dots per inch)

Engine and Language

Ghostscript allows 2 pieces of configuration to be provided on the command line:

  • -dOCREngine=<integer>
  • -sOCRLanguage=<string>

Within Tesseract the OCREngine value allows us to select which of the Tesseract "engines" we use. Similarly, OCRLanguage gives us a string that says which 'traineddata' files to feed to those engines.

Other OCR providers could use those same arguments to get other arbitrary data in; perhaps a comma-separated list of options if that is more appropriate.

Initialisation and Shutdown

Regardless of the device used, the first call will be to:

int ocr_init_api(gs_memory_t *mem, const char *language, int engine, void **state);

  • gs_memory_t *mem: This 'mem' pointer should be used to allocate any memory that may be required.
  • const char *language: A null-terminated string, containing the value of OCRLanguage on startup ('eng' by default, but maybe we should move the default below the interface in future).
  • int engine: An integer, containing the value of OCREngine on startup (0 by default).
  • void **state: A pointer to somewhere to store a void * pointer.

This function is called to initialise an OCR instance. It is expected that the OCR provider will allocate some state, and store a pointer to it in *state. On success, return 0, on failure return a negative number.

Bookending this, the final call will be to:

void ocr_fin_api(gs_memory_t *mem, void *state);

  • gs_memory_t *mem: This 'mem' pointer should be used to free any memory that has been allocated during the run.
  • void *state: The state pointer returned in *state from the ocr_init_api call.

This function is called to close down an OCR instance. The OCR provider should free any memory/release any resources it may be using.

Operation with the pdfocr{8,24,32} devices.

These devices render pages to images (grey, RGB and CMYK respectively) and then calls the OCR provider to perform an OCR step. Both the images and the text recovered from this are wrapped into pages within an output PDF file.

Once Ghostscript has rendered a page, it will then pass it to the OCR provider to recover the text, by calling:

int ocr_recognise(void *state, int w, int h, void *data,
                  int xres, int yres,
                  int (*callback)(void *, const char *, const int *, const int  *, const int *, int),
                  void *secret);

  • void *state: The value returned in *state from the ocr_init_api call.
  • int w, int h, void *data, int xres, int yres: As described above.
  • raster is unspecified, but is assumed to be (w+3)&~3. (i.e. rounded up to the next multiple of 4).
  • bpp is unspecified, but is assumed to be 8.
  • callback: A callback function that should be called back with details of the text found.
  • void *secret: An opaque pointer that should be parrotted back to the caller in the callback function.

The callback function is of the form:

int callback(void *secret, const char *text, const int *line_bbox, const int *word_bbox, const int *char_bbox, int pointsize);

  • void *secret: The secret value passed into ocr_recognise - to enable the caller to find its own state.
  • const char *text: A null-terminated series of utf-8 bytes representing a group of unicode characters (typically from a single glyph).
  • const int *line_bbox: 4 ints representing left/top/bottom/right of the box of the line containing the current word/glyph.
  • const int *word_bbox: 4 ints representing left/top/bottom/right of the box of the word containing the current glyph.
  • const int *char_bbox: 4 ints representing left/top/bottom/right of the box containing the current glyph.

(Tesseract's LSTM engine is very poor at returning the char_bbox, so, currently, we do not rely upon this data. We would love to get accurate data here to work with.)

The callback will return a non-negative number (typically zero) upon success, or a negative number on failure. On failure, the OCR provider should abandon the OCR pass.

Operation with the pdfwrite device.

The pdfwrite device does not render the entire page. Instead, it collects together smaller bitmaps containing (typically) a word (or maybe a line fragment, or a word fragment, or, worst case, a single glyph). It then passes these bitmaps in to be OCRd. It then remembers which unicode value came back with which glyph so it can avoid repeatedly OCRing for the same glyph.

This uses the ocr_init_api and ocr_fin_api functions as described above, but will call the following function for each bitmap fragment:

int ocr_bitmap_to_unicodes(void *state,
    const void *data, int data_x,
    int w, int h, int raster, int xres, int yres,
    int *unicode, int *char_count);

  • void *state: The value returned in *state from the ocr_init_api function.
  • const void *data, int data_x, int w, int h, int raster, int xres, int yres: The bitmap fragment to be OCRd.
  • int *unicode: A buffer for unicode values (preallocated by the caller, *char_count entries of space).
  • int *char_count: On entry, the maximum number of chars to return. On exit, the number of chars returned.

Note, that this takes a bitmap; each pixel is represented by a bit, unlike the other functions which have pixels represented by 1 (or more) bytes. The bitmap fills from most significant bit downwards in each byte. Consequently, data_x is used to define how many bits should be skipped on the left-hand edge of the first byte.

In the tesseract implementation, we have to copy these bitmaps into "bytemaps" before feeding them into the engine. The code for this can be seen in ocr_set_bitmap and may be useful for other implementers. Please be aware that Tesseract (and Leptonica) require a 'strange' ordering of bytes, whereby pixels are filled in 32bit ints from the lowest upwards. This means that the ocr_set_bitmap routine contains some "^ 3" that will probably not be required for any other integration!

The OCR provider should fill in unicode[i] for 0 ≤ i < n where n is the number of unicode characters found. I believe we assume these will come left to right.

Return 0 for success, negative for error.

In future, we may modify this so that the allocation for the unicode block moves into the interface.

Operation with the ocr device.

The ocr device is a very simple one that renders pages to 8bpp greyscale, OCRs them, and outputs a utf-8 stream of the characters detected.

This uses the ocr_init_api and ocr_fin_api functions as described above, but will call the following function for each page:

int ocr_image_to_ocr(void *state,
                      int w, int h, int bpp, int raster,
                      int xres, int yres, void *data, int restore,
                      char **out)

  • void *state: The value returned in *state* from the ocr_init_api function.
  • const void *data, int w, int h, int bpp, int raster, int xres, int yres: The image fragment to be OCRd.
  • int restore: A hacky value to control whether the image data must remain unchanged.
  • char **out: A place to return the null-terminated results (a stream of utf-8 encoded characters).

The restore flag is a nasty hack. Ghostscript generates images in the format described above. Tesseract requires a strange byte order packing of bytes into ints. To avoid having to allocate a new block to make this alternate packing, we permute the data within the block passed from Ghostscript. If the restore flag is true, then the caller needs the data back in the original format after the call completes - so we must permute it back again (i.e. 'restore' it). If the flag is false, then the data can be left unpermuted. Currently, this is always called with restore == false.

The caller is expected to free the block given by *out on return. It will have been allocated using the standard Ghostscript allocators using the gs_memory_t *mem supplied at init time. Allocating blocks one side of an API and freeing them another is nasty, and so we may change this in future.

Return 0 for success, negative for error.

Operation with the hocr device.

The hocr device is a very simple one that renders pages to 8bpp greyscale, OCRs them, and outputs an hocr format stream.

This uses the ocr_init_api and ocr_fin_api functions as described above, but will call the following function for each page:

int ocr_image_to_hocr(void *state,
                      int w, int h, int bpp, int raster,
                      int xres, int yres, void *data, int restore,
                      int pagecount, char **out)

  • void *state: The value returned in *state* from the ocr_init_api function.
  • const void *data, int w, int h, int bpp, int raster, int xres, int yres: The image fragment to be OCRd.
  • int restore: A hacky value to control whether the image data must remain unchanged.
  • int pagecount: The number of pages OCRd in this document so far.
  • char **out: A place to return the null-terminated results (a stream of hocr format data).

The restore flag is a nasty hack. Ghostscript generates images in the format described above. Tesseract requires a strange byte order packing of bytes into ints. To avoid having to allocate a new block to make this alternate packing, we permute the data within the block passed from Ghostscript. If the restore flag is true, then the caller needs the data back in the original format after the call completes - so we must permute it back again (i.e. 'restore' it). If the flag is false, then the data can be left unpermuted. Currently, this is always called with restore == false.

The caller is expected to free the block given by *out on return. It will have been allocated using the standard Ghostscript allocators using the gs_memory_t *mem supplied at init time. Allocating blocks one side of an API and freeing them another is nasty, and so we may change this in future.

Return 0 for success, negative for error.

-- Robin Watts - 2021-03-09

Comments

Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r3 - 2021-03-10 - RobinWatts
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright 2014 Artifex Software Inc