Difference: PdfToDocx (16 vs. 17)

Revision 172020-10-20 - JulianSmith

Line: 1 to 1
 
META TOPICPARENT name="JulianSmith"

Converting pdf to docx.

Changed:
<
<

2020-10-8:

>
>

Changelog

 
Changed:
<
<
Can now copy images into .docx files. We don't currently try to position them similarly to the PDF.
>
>
  • 2020-10-8: Can now copy images into .docx files. We don't currently try to position them similarly to the PDF.
  • 2020-10-5: Improved rotated text.
  • 2020-10-2: We can now put rotated text into a rotated box.
 
Changed:
<
<

2020-10-5:

Improved rotated text. E.g. see:

Compare with Acrobat output for same file:

2020-10-2:

We can now put rotated text into a rotated box.

Directory containing .pdf test files and .docx files with text extracted using gs and mutool to generate intermediate information, then using extract.exe to piece together text spans and write into .docx files:

>
>

Example input/output files:

 
Changed:
<
<
Repository with source for extract.exe, test files etc:
>
>

Usage

 
Changed:
<
<
>
>
  • Use either gs or mutool to output XML containing low-level spans (sequences of glyphs in a single font) and raw image data.
    • gs -sDEVICE=txtwrite -dTextFormat=4 -o foo.xml foo.pdf
    • mutool draw -F xmltext -o foo.xml foo.pdf
  • Use extract-exe to read this XML and apply various heuristics to extract text as paragraphs, and images, and create an output .docx file.
    • extract-exe -i foo.xml -o foo.docx
 
Changed:
<
<
API for extract is:
>
>
Alternatively if mutool was build with thirdparty/extract/ present, it can generate .docx files directly (this uses the same underlying code as extract-exe):
 
Changed:
<
<
>
>
  • mutool convert -o foo.docx foo.pdf
 
Changed:
<
<

>
>

How it works

 
Added:
>
>
Extracting text works like this:
 
Deleted:
<
<
The general approach here is:

  • Use new xmltext mupdf device to output a PDF file's native low-level spans (sequences of glyphs in a single font) as XML; in future it could also output images.
  • Run a separate external programme that reads this XML and applies various heuristics to extract text as paragraphs which are written into an output .docx file.

The separate external programme works like this:

  • Reads the XML created by the raw device.
 
  • Split the low-level spans where glyphs are not adjacent.
  • Join spans into lines.
  • Join lines into paragraphs.
Changed:
<
<
  • Create a .docx file containing these paragraphs.
>
>
  • Create a .docx file containing these paragraphs. Rotated text is placed into rotated text boxes. Images are placed approximately in the right page, but currently we don't attempt to place them more accurately than that.
 
Changed:
<
<
Text extraction is similar to what ghostpdl:devices/vector/gdevtxtw.c and mupdf's stext device does.
>
>
This text extraction is similar to ghostpdl:devices/vector/gdevtxtw.c and mupdf's stext.
  We only join spans, lines and paragraphs which have the same ctm matrix.
Line: 59 to 41
  Creating a .docx file involves:
Changed:
<
<
  • Using an internal template .docx file.
  • writing paragraphs as XML into the tree's word/document.xml file.
  • zipping up the tree into the new .docx file.
>
>
  • Using an internal template .docx file to create an in-memory file tree.
  • Writing paragraphs as XML into the tree's word/document.xml file.
  • Copying image files into the tree.
  • Creating internal references to these image files.
  • Zipping up the tree to form the generated .docx file.

We currently use no compression when zipping. The resulting .docx files appear to load fine in Word and LibreOffice.

Repository with source for extract.exe, test files etc.

API for extract.

 
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright 2014 Artifex Software Inc