Converting pdf to docx.

Changelog

  • 2020-10-23: The extract library is now a mupdf git submodule.
  • 2020-10-8: Can now copy images into .docx files. We don't currently try to position them similarly to the PDF.
  • 2020-10-5: Improved rotated text.
  • 2020-10-2: We can now put rotated text into a rotated box.

Example input/output files:

Usage

  • Use either gs or mutool to output XML containing low-level spans (sequences of glyphs in a single font) and raw image data.
    • gs -sDEVICE=txtwrite -dTextFormat=4 -o foo.xml foo.pdf
    • mutool draw -F xmltext -o foo.xml foo.pdf
  • Use extract-exe to read this XML and apply various heuristics to extract text as paragraphs, and images, and create an output .docx file.
    • extract-exe -i foo.xml -o foo.docx

Alternatively if mutool was build with the *thirdparty/extract git submodule, it can generate .docx files directly (this uses the same underlying code as extract-exe):

  • mutool convert -o foo.docx foo.pdf

How it works

Extracting text works like this:

  • Split the low-level spans where glyphs are not adjacent.
  • Join spans into lines.
  • Join lines into paragraphs.
  • Create a .docx file containing these paragraphs. Rotated text is placed into rotated text boxes. Images are placed approximately in the right place, near paragraphs that were on the same page in the original PDF.

This text extraction approach is similar to ghostpdl:devices/vector/gdevtxtw.c and mupdf's stext.

We only join spans, lines and paragraphs which have the same ctm matrix.

We also do some extra processing such as removing spurious spaces which are overlapped by the next glyph, removing '-' at the end of lines etc.

Creating a .docx file involves:

  • Using an internal template .docx file to create an in-memory file tree.
  • Writing paragraphs as XML into the tree's word/document.xml file.
  • Copying image files into the tree.
  • Creating internal references to these image files.
  • Zipping up the tree to form the generated .docx file.

We currently use no compression when zipping. The resulting .docx files appear to load fine in Word and LibreOffice.

Repository with source for extract.exe, test files etc.

API for extract.


-- Julian Smith - 2020-06-29

Comments

Edit | Attach | Watch | Print version | History: r18 < r17 < r16 < r15 < r14 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r18 - 2020-10-26 - JulianSmith
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright 2014 Artifex Software Inc