Difference: PdfToDocx (17 vs. 18)

Revision 182020-10-26 - JulianSmith

Line: 1 to 1
 
META TOPICPARENT name="JulianSmith"

Converting pdf to docx.

Changelog

Added:
>
>
  • 2020-10-23: The extract library is now a mupdf git submodule.
 
  • 2020-10-8: Can now copy images into .docx files. We don't currently try to position them similarly to the PDF.
  • 2020-10-5: Improved rotated text.
  • 2020-10-2: We can now put rotated text into a rotated box.
Line: 20 to 21
 
  • Use extract-exe to read this XML and apply various heuristics to extract text as paragraphs, and images, and create an output .docx file.
    • extract-exe -i foo.xml -o foo.docx
Changed:
<
<
Alternatively if mutool was build with thirdparty/extract/ present, it can generate .docx files directly (this uses the same underlying code as extract-exe):
>
>
Alternatively if mutool was build with the *thirdparty/extract git submodule, it can generate .docx files directly (this uses the same underlying code as extract-exe):
 
  • mutool convert -o foo.docx foo.pdf
Line: 31 to 32
 
  • Split the low-level spans where glyphs are not adjacent.
  • Join spans into lines.
  • Join lines into paragraphs.
Changed:
<
<
  • Create a .docx file containing these paragraphs. Rotated text is placed into rotated text boxes. Images are placed approximately in the right page, but currently we don't attempt to place them more accurately than that.
>
>
  • Create a .docx file containing these paragraphs. Rotated text is placed into rotated text boxes. Images are placed approximately in the right place, near paragraphs that were on the same page in the original PDF.
 
Changed:
<
<
This text extraction is similar to ghostpdl:devices/vector/gdevtxtw.c and mupdf's stext.
>
>
This text extraction approach is similar to ghostpdl:devices/vector/gdevtxtw.c and mupdf's stext.
  We only join spans, lines and paragraphs which have the same ctm matrix.
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright 2014 Artifex Software Inc