Tags:
create new tag
view all tags

Converting pdf to docx.

Changelog

  • 2021-2-16: Extract can be built into mupdf and ghostpdl.
  • 2020-10-23: The extract library is now a mupdf git submodule.
  • 2020-10-8: Can now copy images into .docx files. We don't currently try to position them similarly to the PDF.
  • 2020-10-5: Improved rotated text.
  • 2020-10-2: We can now put rotated text into a rotated box.

Example input/output files:

Usage

If mutool / gs were built with extract support:

  • mutool convert -o foo.docx foo.pdf
    • (This uses fz_new_docx_writer() internally.)
  • gs -sDEVICE=docxwrite -o foo.docx foo.pdf

Older style usage:

  • Use either gs or mutool to output XML containing low-level spans (sequences of glyphs in a single font) and raw image data.
    • gs -sDEVICE=txtwrite -dTextFormat=4 -o foo.xml foo.pdf
    • mutool draw -F xmltext -o foo.xml foo.pdf
  • Use extract-exe to read this XML and apply various heuristics to extract text as paragraphs, and images, and create an output .docx file.
    • extract-exe -i foo.xml -o foo.docx

How it works

Extracting text works like this:

  • Split the low-level spans where glyphs are not adjacent.
  • Join spans into lines.
  • Join lines into paragraphs.
  • Create a .docx file containing these paragraphs. Rotated text is placed into rotated text boxes. Images are placed approximately in the right place, near paragraphs that were on the same page in the original PDF.

This text extraction approach is similar to ghostpdl:devices/vector/gdevtxtw.c and mupdf's stext.

We only join spans, lines and paragraphs which have the same ctm matrix.

We also do some extra processing such as removing spurious spaces which are overlapped by the next glyph, removing '-' at the end of lines etc.

Creating a .docx file involves:

  • Using an internal template .docx file to create an in-memory file tree.
  • Writing paragraphs as XML into the tree's word/document.xml file.
  • Copying image files into the tree.
  • Creating internal references to these image files.
  • Zipping up the tree to form the generated .docx file.

We currently use no compression when zipping. The resulting .docx files appear to load fine in Word and LibreOffice.

Repository with source for extract.exe, test files etc.

API for extract.


-- Julian Smith - 2020-06-29

Comments

Edit | Attach | Watch | Print version | History: r19 < r18 < r17 < r16 < r15 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r19 - 2021-02-16 - JulianSmith
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright 2014 Artifex Software Inc