Auto-generated C++ and Python APIs for mupdf.
Status
As of 2021-2-4:
Customer page:
C++
- We generate C++ wrapper functions for most fz_ and pdf_ functions. These wrapper convert fz_ exceptions into C++ exceptions, and use auto-generated per-thread fz_context's.
- We generate C++ class wrappers for most fz_ and pdf_ structs.
- We auto-detect fz_*() and pdf_*() fns suitable for wrapping as constructors, methods or static methods.
- Some generated classes have auto-generated support for iteration.
- We add various custom methods/constructors.
- Wrapper class constructors and methods provide access to 1270 fz_*() and pdf_*() fns, out of a total of 1513 wrapped fz_*() and pdf_*() functions. Most of the omitted functions don't take struct args, e.g. fz_strlcpy().
- The C++ API is built by mupdf:scripts/mupdfwrap.py. It requires clang-6 or clang-7, and python-clang.
Python
- Python API is generated by running SWIG on the C++ API's header files.
- Python API is enough to allow implementation of mutool in Python - see mupdf:scripts/mutool.py and mupdf:scripts/mutool_draw.py.
- Building the Python API requires swig-3 or swig-4.
General
- We work on nuc1 and peeved and jules-laptop.
- We require:
- python-clang (version 6 or 7)
- python3-dev (version 3.6 or later)
- swig (version 3 or 4)
Comments
- We use clang to extract doxygen-style comments, and propagate them into generated header files.
- If swig is version 4+, we tell it to propagate comments into the generated mupdf.py.
Here are Doxygen html representations of the mupdf C API and the generated mupdf C++ API:
And pydoc html representation of the generated mupdf.py API:
mutool.py
mudpdf:scripts/mutool*.py are an incomplete Python re-implementation of the mutool application.
Files
Auto-generated C++ headers and implementation files, plus test outputs (.html files have syntax-colouring):
Information about fz_*() and pdf_*() fns that are not in the class-based API:
These were generated by the mupdfwrap.py programme, which also runs g++ and SWIG to generate a Python module that gives a Python API:
The generated Python module is tested by the (rather hacky) test_mupdfcpp_swig() function in mupdfwrap.py. For convenience, this function and its output can be viewed in
https://ghostscript.com/~julian/mupdf/platform/python
.
Integration with mupdf git.
mupdf/
build/
shared-release/
libmupdf.so [generated file]
libmupdfcpp.so [generated file, implements C++ API]
mupdf.py [generated file, implements Python API]
_mupdf.so [generated file, implements Python API internals]
shared-debug/
libmupdf.so
libmupdfcpp.so [implements C++ API]
mupdf.py [implements Python API]
_mupdf.so [implements Python API internals]
platform/
c++/
implementation/
*.cpp [generated files]
include/
mupdf/
*.h [generated files]
python/
mupdfcpp_swig.cpp [generated by SWIG]
mupdf_swig.i [generated by mupdfwraw.pynput to SWIG]
scripts/
mupdfwrap.py
jlib.py
mutool.py
mutool_draw.py
See:
To build:
cd mupdf/
./scripts/mupdfwrap.py -b all -t
Comparison with PyMuPDF
- Am writing equivalent code to some example programmes in https://github.com/pymupdf/PyMuPDF-Utilities
.
- Method names are usually different, because PyMuPDF uses its own names instead of basing names on the underlying MuPDF API.
- Have made various additions/fixes to mypdfwrap.py (for details see: https://git.ghostscript.com/?p=user/julian/mupdf.git;a=summary
)
- Added Document::lookup_metadata() method overload that returns std::string.
- added global const std::vector<std::string> metadata_keys.
- Changed Outline iteration to include depth information.
- Fixed ref-counting in Page::load_links().
- fixed Page::search_page() to return std::vector.
- Added python wrapper for PdfDocument::page_write() out-params
- Using improved scheme for wrapping functions/methods with out-params - instead of trying to use SWIG's typemaps, which are very clumy in the context of mupdfwrap.py and seemingly more designed for custom-written .i files, we now use simple auto-generated C functions to package up out-params into a struct, then extract into a tuple in auto-generated Python.
- Provide two wrappers for mupdf.Buffer.buffer_extract() - return raw C (size, data) values or return a Python bytes. The former can be used to construct a mupdf.Stream constructor (doesn't seem possible to convert a Python bytes back into (size, data)). [This allows us to mimic PyMuPDF-Utilities/demo/pdf-converter.py.]
- PyMuPDF has more information about links - fitz.LINK_GOTO, LINK_GOTOR, fitz.LINK_LAUNCH, fitz.LINK_URI.
- PyMuPDF has abstraction for writing image files which calls fz_save_pixmap_as_png() or fz_save_pixmap_as_pnm() etc, depending on the filename.
- PyMuPDF can copy a TOC into a PdfDocument.
- Looks like PyMuPDF has fairly elaborate support for redactions. Makes use of pdf_redact_page() and then writes on top of the redaction?
PyMuPDF:
https://github.com/pymupdf/PyMuPDF-Utilities/blob/master/demo/demo.py
Equivalent code using mupdfwrap:
#! /usr/bin/env python3
import mupdf
import os
import sys
assert len(sys.argv) == 7
filename, page_num, zoom, rotate, output, needle = sys.argv[1:]
page_num = int(page_num)
zoom = int(zoom)
rotate = int(rotate)
document = mupdf.Document(filename)
print('')
print(f'Document {filename} has {document.count_pages()} pages.')
print('')
print(f'Metadata Information:')
print(f'mupdf.metadata_keys={mupdf.metadata_keys}')
for key in mupdf.metadata_keys:
value = document.lookup_metadata(key)
print(f' {key}: {value!r}')
print('')
outline = mupdf.Outline(document)
for o in outline:
print(f' {" "*4*o.m_depth}{o.m_depth}: {o.m_outline.title()}')
if page_num > document.count_pages():
raise SystemExit(f'page_num={page_num} is out of range - {filename} has {document.count_pages()} pages')
page = document.load_page(page_num)
links = page.load_links()
if links:
print(f'Links on page {page_num}:')
for link in links:
if link.m_internal:
print(f' extern={mupdf.is_external_link(link.uri())}: {link.uri()}')
else:
print(f'No links on page {page_num}')
trans = mupdf.Matrix.scale(zoom / 100.0, zoom / 100.0).pre_rotate(rotate)
pixmap = page.new_pixmap_from_page(trans, mupdf.Colorspace(mupdf.Colorspace.Fixed_RGB), alpha=False)
def save_pixmap(path):
suffix = os.path.splitext(path)[1]
if 0: pass
elif suffix == '.pam': pixmap.save_pixmap_as_pam(path)
elif suffix == '.pbm': pixmap.save_pixmap_as_pbm(path)
elif suffix == '.pcl': pixmap.save_pixmap_as_pcl(path, append=0, options=mupdf.PclOptions())
elif suffix == '.pclm': pixmap.save_pixmap_as_pclm(path, append=0, options=mupdf.PclmOptions())
elif suffix == '.pdfocr':pixmap.save_pixmap_as_pdfocr(path, append=0, options=mupdf.PdfocrOptions())
elif suffix == '.pkm': pixmap.save_pixmap_as_pkm(path)
elif suffix == '.png': pixmap.save_pixmap_as_png(path)
elif suffix == '.pnm': pixmap.save_pixmap_as_pnm(path)
elif suffix == '.ppm': pixmap.save_pixmap_as_ppm(path)
elif suffix == '.ps': pixmap.save_pixmap_as_ps(path, append=0)
elif suffix == '.psd': pixmap.save_pixmap_as_psd(path)
elif suffix == '.pwg': pixmap.save_pixmap_as_pwg(path, append=0, pwg=mupdf.PwgOptions())
else:
raise Exception(f'Unrecognised output format: {path}')
save_pixmap(output)
hit_quads = page.search_page(needle, max=16)
print(f'search text {needle!r} found {len(hit_quads)} on the page')
for hit_quad in hit_quads:
pixmap.invert_pixmap_rect(hit_quad.rect_from_quad().irect_from_rect())
save_pixmap(f'dl-{output}')
print('finished')
--
Julian Smith - 2020-03-04
Comments