create new tag
view all tags

Converting pdf to docx.


The general approach here is to run mutool.py draw -F stext to get an XML representation of the lines and blocks of text in a .pdf document, then apply simple heuristics to turn this into 'runs' of text in the same font, which can then be readily turned into something that can be embedded inside a .docx file.

mupdf:source/tools/ptodoc.c does it in C.

mupdf:scripts/ptodoc.py does this in Python, and also runs some tests, including building and running source/tools/ptodoc.c.

-- Julian Smith - 2020-06-29


Topic revision: r1 - 2020-06-29 - JulianSmith
This site is powered by the TWiki collaboration platform Powered by PerlCopyright 2014 Artifex Software Inc