Converting pdf to docx.

2020-10-8:

Can now copy images into .docx files. We don't currently try to position them similarly to the PDF.

2020-10-5:

Improved rotated text. E.g. see:

Compare with Acrobat output for same file:

2020-10-2:

We can now put rotated text into a rotated box.

Directory containing .pdf test files and .docx files with text extracted using gs and mutool to generate intermediate information, then using extract.exe to piece together text spans and write into .docx files:

Repository with source for extract.exe, test files etc:

API for extract is:


The general approach here is:

  • Use new xmltext mupdf device to output a PDF file's native low-level spans (sequences of glyphs in a single font) as XML; in future it could also output images.
  • Run a separate external programme that reads this XML and applies various heuristics to extract text as paragraphs which are written into an output .docx file.

The separate external programme works like this:

  • Reads the XML created by the raw device.
  • Split the low-level spans where glyphs are not adjacent.
  • Join spans into lines.
  • Join lines into paragraphs.
  • Create a .docx file containing these paragraphs.

Text extraction is similar to what ghostpdl:devices/vector/gdevtxtw.c and mupdf's stext device does.

We only join spans, lines and paragraphs which have the same ctm matrix.

We also do some extra processing such as removing spurious spaces which are overlapped by the next glyph, removing '-' at the end of lines etc.

Creating a .docx file involves:

  • Using an internal template .docx file.
  • writing paragraphs as XML into the tree's word/document.xml file.
  • zipping up the tree into the new .docx file.


-- Julian Smith - 2020-06-29

Comments

Edit | Attach | Watch | Print version | History: r19 < r18 < r17 < r16 < r15 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r16 - 2020-10-08 - JulianSmith
 
  • Edit
  • Attach
This site is powered by the TWiki collaboration platform Powered by PerlCopyright 2014 Artifex Software Inc