Auto-generated C++ and Python APIs for mupdf.

Status

As of 2021-2-4:

Customer page:

C++

  • We generate C++ wrapper functions for most fz_ and pdf_ functions. These wrapper convert fz_ exceptions into C++ exceptions, and use auto-generated per-thread fz_context's.
  • We generate C++ class wrappers for most fz_ and pdf_ structs.
  • We auto-detect fz_*() and pdf_*() fns suitable for wrapping as constructors, methods or static methods.
  • Some generated classes have auto-generated support for iteration.
  • We add various custom methods/constructors.
  • Wrapper class constructors and methods provide access to 1270 fz_*() and pdf_*() fns, out of a total of 1513 wrapped fz_*() and pdf_*() functions. Most of the omitted functions don't take struct args, e.g. fz_strlcpy().
  • The C++ API is built by mupdf:scripts/mupdfwrap.py. It requires clang-6 or clang-7, and python-clang.

Python

  • Python API is generated by running SWIG on the C++ API's header files.
  • Python API is enough to allow implementation of mutool in Python - see mupdf:scripts/mutool.py and mupdf:scripts/mutool_draw.py.
  • Building the Python API requires swig-3 or swig-4.

General

  • We work on nuc1 and peeved and jules-laptop.
  • We require:
    • python-clang (version 6 or 7)
    • python3-dev (version 3.6 or later)
    • swig (version 3 or 4)

Comments

  • We use clang to extract doxygen-style comments, and propagate them into generated header files.
  • If swig is version 4+, we tell it to propagate comments into the generated mupdf.py.

Here are Doxygen html representations of the mupdf C API and the generated mupdf C++ API:

And pydoc html representation of the generated mupdf.py API:

mutool.py

mudpdf:scripts/mutool*.py are an incomplete Python re-implementation of the mutool application.

Files

Auto-generated C++ headers and implementation files, plus test outputs (.html files have syntax-colouring):

Information about fz_*() and pdf_*() fns that are not in the class-based API:

These were generated by the mupdfwrap.py programme, which also runs g++ and SWIG to generate a Python module that gives a Python API:

The generated Python module is tested by the (rather hacky) test_mupdfcpp_swig() function in mupdfwrap.py. For convenience, this function and its output can be viewed in https://ghostscript.com/~julian/mupdf/platform/python.

Integration with mupdf git.

    mupdf/
        build/
            shared-release/
                libmupdf.so [generated file]
                libmupdfcpp.so [generated file, implements C++ API]
                mupdf.py [generated file, implements Python API]
                _mupdf.so [generated file, implements Python API internals]
            shared-debug/
                libmupdf.so
                libmupdfcpp.so [implements C++ API]
                mupdf.py [implements Python API]
                _mupdf.so [implements Python API internals]
        platform/
            c++/
                implementation/
                    *.cpp [generated files]
                include/
                    mupdf/
                        *.h [generated files]
            python/
                mupdfcpp_swig.cpp [generated by SWIG]
                mupdf_swig.i [generated by mupdfwraw.pynput to SWIG]
        scripts/
            mupdfwrap.py
            jlib.py
            mutool.py
            mutool_draw.py

See:

To build:

    cd mupdf/
    ./scripts/mupdfwrap.py -b all -t

Comparison with PyMuPDF

  • Am writing equivalent code to some example programmes in https://github.com/pymupdf/PyMuPDF-Utilities.
  • Method names are usually different, because PyMuPDF uses its own names instead of basing names on the underlying MuPDF API.
  • Have made various additions/fixes to mypdfwrap.py (for details see: https://git.ghostscript.com/?p=user/julian/mupdf.git;a=summary)
    • Added Document::lookup_metadata() method overload that returns std::string.
    • added global const std::vector<std::string> metadata_keys.
    • Changed Outline iteration to include depth information.
    • Fixed ref-counting in Page::load_links().
    • fixed Page::search_page() to return std::vector.
    • Added python wrapper for PdfDocument::page_write() out-params
    • Using improved scheme for wrapping functions/methods with out-params - instead of trying to use SWIG's typemaps, which are very clumy in the context of mupdfwrap.py and seemingly more designed for custom-written .i files, we now use simple auto-generated C functions to package up out-params into a struct, then extract into a tuple in auto-generated Python.
    • Provide two wrappers for mupdf.Buffer.buffer_extract() - return raw C (size, data) values or return a Python bytes. The former can be used to construct a mupdf.Stream constructor (doesn't seem possible to convert a Python bytes back into (size, data)). [This allows us to mimic PyMuPDF-Utilities/demo/pdf-converter.py.]
  • PyMuPDF has more information about links - fitz.LINK_GOTO, LINK_GOTOR, fitz.LINK_LAUNCH, fitz.LINK_URI.
  • PyMuPDF has abstraction for writing image files which calls fz_save_pixmap_as_png() or fz_save_pixmap_as_pnm() etc, depending on the filename.
  • PyMuPDF can copy a TOC into a PdfDocument.

PyMuPDF: https://github.com/pymupdf/PyMuPDF-Utilities/blob/master/demo/demo.py

Equivalent code using mupdfwrap:

#! /usr/bin/env python3

import mupdf

import os
import sys


assert len(sys.argv) == 7
filename, page_num, zoom, rotate, output, needle = sys.argv[1:]
page_num = int(page_num)
zoom = int(zoom)
rotate = int(rotate)

document = mupdf.Document(filename)

print('')
print(f'Document {filename} has {document.count_pages()} pages.')
print('')
print(f'Metadata Information:')
print(f'mupdf.metadata_keys={mupdf.metadata_keys}')
for key in mupdf.metadata_keys:
    value = document.lookup_metadata(key)
    print(f'    {key}: {value!r}')
print('')

outline = mupdf.Outline(document)
for o in outline:
    print(f'    {" "*4*o.m_depth}{o.m_depth}: {o.m_outline.title()}')

if page_num > document.count_pages():
    raise SystemExit(f'page_num={page_num} is out of range - {filename} has {document.count_pages()} pages')

page = document.load_page(page_num)
links = page.load_links()
if links:
    print(f'Links on page {page_num}:')
    for link in links:
        if link.m_internal:
            print(f'    extern={mupdf.is_external_link(link.uri())}: {link.uri()}')
else:
    print(f'No links on page {page_num}')

trans = mupdf.Matrix.scale(zoom / 100.0, zoom / 100.0).pre_rotate(rotate)

pixmap = page.new_pixmap_from_page(trans, mupdf.Colorspace(mupdf.Colorspace.Fixed_RGB), alpha=False)

def save_pixmap(path):
    suffix = os.path.splitext(path)[1]
    if 0: pass
    elif suffix == '.pam':   pixmap.save_pixmap_as_pam(path)
    elif suffix == '.pbm':   pixmap.save_pixmap_as_pbm(path)
    elif suffix == '.pcl':   pixmap.save_pixmap_as_pcl(path, append=0, options=mupdf.PclOptions())
    elif suffix == '.pclm':  pixmap.save_pixmap_as_pclm(path, append=0, options=mupdf.PclmOptions())
    elif suffix == '.pdfocr':pixmap.save_pixmap_as_pdfocr(path, append=0, options=mupdf.PdfocrOptions())
    elif suffix == '.pkm':   pixmap.save_pixmap_as_pkm(path)
    elif suffix == '.png':   pixmap.save_pixmap_as_png(path)
    elif suffix == '.pnm':   pixmap.save_pixmap_as_pnm(path)
    elif suffix == '.ppm':   pixmap.save_pixmap_as_ppm(path)
    elif suffix == '.ps':    pixmap.save_pixmap_as_ps(path, append=0)
    elif suffix == '.psd':   pixmap.save_pixmap_as_psd(path)
    elif suffix == '.pwg':   pixmap.save_pixmap_as_pwg(path, append=0, pwg=mupdf.PwgOptions())
    else:
        raise Exception(f'Unrecognised output format: {path}')
save_pixmap(output)
hit_quads = page.search_page(needle, max=16)
print(f'search text {needle!r} found {len(hit_quads)} on the page')
for hit_quad in hit_quads:
    pixmap.invert_pixmap_rect(hit_quad.rect_from_quad().irect_from_rect())
save_pixmap(f'dl-{output}')

print('finished')


-- Julian Smith - 2020-03-04

Comments

Edit | Attach | Watch | Print version | History: r25 < r24 < r23 < r22 < r21 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r23 - 2021-02-05 - JulianSmith
 
  • Edit
  • Attach
This site is powered by the TWiki collaboration platform Powered by PerlCopyright 2014 Artifex Software Inc