Tags:
create new tag
view all tags

C++ and Python APIs for MuPDF

Overview

The C++ MuPDF API

  • Auto-generated from the MuPDF C API.
  • Everything is in C++ namespace mupdf.
  • Provides C++ functions that wrap most fz_ and pdf_ functions.
  • Provides C++ classes that wrap most fz_ and pdf_ structs.
  • Class methods provide access to most of the underlying C API functions (except for functions that don't take struct args such as fz_strlcpy()).
  • fz_ exceptions are converted into C++ exceptions.
  • Functions and methods do not take fz_context arguments. (Automatically-generated per-thread contexts are used internally.)
  • Wrapper classes automatically handle reference counting of the underlying structs (with internal calls to fz_keep_*() and fz_drop_*()).
  • Provides a small number of extensions beyond the basic C API:
    • Some generated classes have extra support for iteration.
    • Some custom class methods and constructors.
    • Functions for generating a text representation of some simple 'POD' structs. For example for fz_rect we provide these functions:
              std::ostream& operator<< (std::ostream& out, const fz_rect& rhs);
              std::ostream& operator<< (std::ostream& out, const Rect& rhs);
              std::string to_string_fz_rect(const fz_rect& s);
              std::string to_string(const fz_rect& s);
              std::string Rect::to_string() const;
These each generate text such as: (x0=90.51 y0=160.65 x1=501.39 y1=215.6)

The Python MuPDF API

  • A python module called mupdf.
  • Generated from the C++ MuPDF API's header files.
  • Allows implementation of mutool in Python - see mupdf:scripts/mutool.py and mupdf:scripts/mutool_draw.py.
  • Text representation for simple 'POD' structs.
          rect = mupdf.Rect(...)
          print(rect) # Will output text such as: (x0=90.51 y0=160.65 x1=501.39 y1=215.6)
This works for Python class wrappers for classes where the C++ API defines a to_string() method as described above; these Python classes will have a __str__() method.

API Stability

The C++ and Python MuPDF APIs are currently a beta release and liable to change.

Installing the Python mupdf module using pip

  • As of 2021-3-30, on Unix systems one can install the Python mupdf module using Python's standard package tool, pip:
    • pip install mupdf
  • This requires the SWIG tool to be installed. The install builds from source so will take a few minutes.

Using the Python API

Minimal Python code that uses the mupdf module:

    import mupdf
    document = mupdf.Document('foo.pdf')

A simple example Python test script (run by scripts/mupdfwrap.py -t) is:

  • scripts/mupdfwrap_test.py

More detailed usage of the Python API can be found in:

  • scripts/mutool.py
  • scripts/mutool_draw.py

Here is some example code that shows all available information about document's Stext blocks, lines and characters:

#!/usr/bin/env python3

import mupdf

def show_stext(document):
    '''
    Shows all available information about Stext blocks, lines and characters.
    '''
    for p in range(document.count_pages()):
        page = document.load_page(p)
        stextpage = mupdf.StextPage(page, mupdf.StextOptions())
        for block in stextpage:
            block_ = block.m_internal
            log(f'block: type={block_.type} bbox={block_.bbox}')
            for line in block:
                line_ = line.m_internal
                log(f'    line: wmode={line_.wmode}'
                        + f' dir={line_.dir}'
                        + f' bbox={line_.bbox}'
                        )
                for char in line:
                    char_ = char.m_internal
                    log(f'        char: {chr(char_.c)!r} c={char_.c:4} color={char_.color}'
                            + f' origin={char_.origin}'
                            + f' quad={char_.quad}'
                            + f' size={char_.size:6.2f}'
                            + f' font=('
                                +  f'is_mono={char_.font.flags.is_mono}'
                                + f' is_bold={char_.font.flags.is_bold}'
                                + f' is_italic={char_.font.flags.is_italic}'
                                + f' ft_substitute={char_.font.flags.ft_substitute}'
                                + f' ft_stretch={char_.font.flags.ft_stretch}'
                                + f' fake_bold={char_.font.flags.fake_bold}'
                                + f' fake_italic={char_.font.flags.fake_italic}'
                                + f' has_opentype={char_.font.flags.has_opentype}'
                                + f' invalid_bbox={char_.font.flags.invalid_bbox}'
                                + f' name={char_.font.name}'
                                + f')'
                            )

document = mupdf.Document('foo.pdf')
show_stext(document)

Changes in 2021 Q1

  • Changes that apply to both C++ and Python bindings:
    • Improved access to metadata - added Document::lookup_metadata() overload that returns a std::string. Also provided extern const std::vector<std::string> metadata_keys; containing a list of the supported keys.
    • Iterating over Outline=='s now returns ==OutlineIterator objects so that depth information is also available.
    • Fixed a reference-counting bug in iterators.
    • Page::search_page() now returns a std::vector.
    • PdfDocument now has a default constructor which uses pdf_create_document().
    • Include wrappers for functions that return fz_outline*, e.g. Outline Document::load_outline();.
    • Removed potentially slow call of getenv("MUPDF_trace") in every C++ wrapper function.
    • Removed special-case naming of wrappers for fz_run_page() - they are now called mupdf::run_page() and mupdf::Page::run_page(), not mupdf::run() etc.
    • Added text representation of POD structs.

  • Changes that apply only to Python:
    • Improved handling of out-parameters:
      • If a function or method has out-parameters we now systematically return a Python tuple containing any return value followed by the out-parameters.
      • Don't treat FILE* or pointer-to-const as an out-parameter.
    • Added methods for getting the content of a mupdf.Buffer as a Python bytes instance.
    • Added Python access to nested unions in fz_stext_block wrapper class mupdf.StextBlock.
    • Allow the MuPDF Python bindings to be installed with pip.
      • This uses a source distribution of mupdf that has been uploaded to pypi.org in the normal way.
      • Installation involves compiling the C, C++ and Python bindings so will take a few minutes. It requires SWIG to be installed.
      • Pre-build wheels are not currently provided.

Details

Building the C++ and Python MuPDF APIs directly

Requirements:

  • Linux or OpenBSD.
  • clang-python version 6 or 7. [For example Debian python-clang, OpenBSD py3-llvm.]
  • python3-dev version 3.6 or later.
  • SWIG version 3 or 4.

Build MuPDF shared library, C++ and Python MuPDF APIs, and run basic tests:

    git clone --recursive git://git.ghostscript.com/mupdf.git
    cd mupdf
    ./scripts/mupdfwrap.py -b all -t

As above but do a debug build:

    ./scripts/mupdfwrap.py -d build/shared-debug -b all -t

For more information:

  • Run ./scripts/mupdfwrap.py -h.
  • Read the doc-string at beginning of scripts/mupdfwrap.py.

To use a direct build, run python code with:

PYTHONPATH=build/shared-release LD_LIBRARY_PATH=build/shared-release

(This enables Python to find the mupdf module, and enables the system dynamic linker to find the shared libraries that implement the underlying C, C++ and Python MuPDF APIs.)

Building auto-generated documentation

Build HTML documentation for the C, C++ and Python APIs (using Doxygen and pydoc):

    ./scripts/mupdfwrap.py --doc all

This will generate these documentation roots:

  • include/html/index.html [C API]
  • platform/c++/include/html/index.html [C++ API]
  • build/shared-release/mupdf.html [Python API]

The content is ultimately all generated from the MuPDF C header file comments.

How the build works

Building of MuPDF shared library:

  • Runs make internally.

Generation of the C++ MuPDF API:

  • Uses clang-python to parse MuPDF's C API.
  • Generates C++ code that wraps the basic C interface.
  • Generates C++ classes for each fz_ struct, and uses various heuristics to define constructors, methods and static methods that call fz_() functions.
  • C header file comments are copied into the generated C++ header files.

Generation of the Python MuPDF API:

  • Based on the C++ MuPDF API.
  • Uses SWIG to parse the C++ headers and generate C++ and Python code.
  • Defines some custom-written Python functions and methods.
  • If SWIG is version 4+, C++ comments are converted into Python doc-comments.

Generated files

    mupdf/
        build/
            shared-release/    [Files needed at runtime]
                libmupdf.so    [implements C MuPDF API]
                libmupdfcpp.so [implements C++ MuPDF API]
                mupdf.py       [implements Python MuPDF API]
                _mupdf.so      [implements Python MuPDF API internals]
            shared-debug/
                [as shared-release but debug build]
        platform/
            c++/
                include/
                    mupdf/ [C++ MuPDF API header files]
                        classes.h
                        exceptions.h
                        functions.h
                        internal.h
                implementation/
                    *.cpp [MuPDF C++ implementation files]
            python/
                [SWIG build files]

Artifex Licensing

Artifex offers a dual licensing model for MuPDF. Meaning we offer both commercial licenses or the GNU Affero General Public License (AGPL).

While Open Source software may be free to use, that does not mean it is free of obligation. To determine whether your intended use of MuPDF is suitable for the AGPL, please read the full text of the AGPL license agreement on the FSF web site.

With a commercial license from Artifex, you maintain full ownership and control over your products, while allowing you to distribute your products to customers as you wish. You are not obligated to share your proprietary source code and this saves you from having to conform to the requirements and restrictions of the AGPL. For more information, please see our licensing page, or contact our sales team.


Please send any questions, comments or suggestions about this page to: julian.smith@artifex.com

Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r11 - 2021-04-17 - JulianSmith
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright 2014 Artifex Software Inc