Difference: PdfToDocx (1 vs. 18)

Revision 182020-10-26 - JulianSmith

Line: 1 to 1
 
META TOPICPARENT name="JulianSmith"

Converting pdf to docx.

Changelog

Added:
>
>
  • 2020-10-23: The extract library is now a mupdf git submodule.
 
  • 2020-10-8: Can now copy images into .docx files. We don't currently try to position them similarly to the PDF.
  • 2020-10-5: Improved rotated text.
  • 2020-10-2: We can now put rotated text into a rotated box.
Line: 20 to 21
 
  • Use extract-exe to read this XML and apply various heuristics to extract text as paragraphs, and images, and create an output .docx file.
    • extract-exe -i foo.xml -o foo.docx
Changed:
<
<
Alternatively if mutool was build with thirdparty/extract/ present, it can generate .docx files directly (this uses the same underlying code as extract-exe):
>
>
Alternatively if mutool was build with the *thirdparty/extract git submodule, it can generate .docx files directly (this uses the same underlying code as extract-exe):
 
  • mutool convert -o foo.docx foo.pdf
Line: 31 to 32
 
  • Split the low-level spans where glyphs are not adjacent.
  • Join spans into lines.
  • Join lines into paragraphs.
Changed:
<
<
  • Create a .docx file containing these paragraphs. Rotated text is placed into rotated text boxes. Images are placed approximately in the right page, but currently we don't attempt to place them more accurately than that.
>
>
  • Create a .docx file containing these paragraphs. Rotated text is placed into rotated text boxes. Images are placed approximately in the right place, near paragraphs that were on the same page in the original PDF.
 
Changed:
<
<
This text extraction is similar to ghostpdl:devices/vector/gdevtxtw.c and mupdf's stext.
>
>
This text extraction approach is similar to ghostpdl:devices/vector/gdevtxtw.c and mupdf's stext.
  We only join spans, lines and paragraphs which have the same ctm matrix.

Revision 172020-10-20 - JulianSmith

Line: 1 to 1
 
META TOPICPARENT name="JulianSmith"

Converting pdf to docx.

Changed:
<
<

2020-10-8:

>
>

Changelog

 
Changed:
<
<
Can now copy images into .docx files. We don't currently try to position them similarly to the PDF.
>
>
  • 2020-10-8: Can now copy images into .docx files. We don't currently try to position them similarly to the PDF.
  • 2020-10-5: Improved rotated text.
  • 2020-10-2: We can now put rotated text into a rotated box.
 
Changed:
<
<

2020-10-5:

Improved rotated text. E.g. see:

Compare with Acrobat output for same file:

2020-10-2:

We can now put rotated text into a rotated box.

Directory containing .pdf test files and .docx files with text extracted using gs and mutool to generate intermediate information, then using extract.exe to piece together text spans and write into .docx files:

>
>

Example input/output files:

 
Changed:
<
<
Repository with source for extract.exe, test files etc:
>
>

Usage

 
Changed:
<
<
>
>
  • Use either gs or mutool to output XML containing low-level spans (sequences of glyphs in a single font) and raw image data.
    • gs -sDEVICE=txtwrite -dTextFormat=4 -o foo.xml foo.pdf
    • mutool draw -F xmltext -o foo.xml foo.pdf
  • Use extract-exe to read this XML and apply various heuristics to extract text as paragraphs, and images, and create an output .docx file.
    • extract-exe -i foo.xml -o foo.docx
 
Changed:
<
<
API for extract is:
>
>
Alternatively if mutool was build with thirdparty/extract/ present, it can generate .docx files directly (this uses the same underlying code as extract-exe):
 
Changed:
<
<
>
>
  • mutool convert -o foo.docx foo.pdf
 
Changed:
<
<

>
>

How it works

 
Added:
>
>
Extracting text works like this:
 
Deleted:
<
<
The general approach here is:

  • Use new xmltext mupdf device to output a PDF file's native low-level spans (sequences of glyphs in a single font) as XML; in future it could also output images.
  • Run a separate external programme that reads this XML and applies various heuristics to extract text as paragraphs which are written into an output .docx file.

The separate external programme works like this:

  • Reads the XML created by the raw device.
 
  • Split the low-level spans where glyphs are not adjacent.
  • Join spans into lines.
  • Join lines into paragraphs.
Changed:
<
<
  • Create a .docx file containing these paragraphs.
>
>
  • Create a .docx file containing these paragraphs. Rotated text is placed into rotated text boxes. Images are placed approximately in the right page, but currently we don't attempt to place them more accurately than that.
 
Changed:
<
<
Text extraction is similar to what ghostpdl:devices/vector/gdevtxtw.c and mupdf's stext device does.
>
>
This text extraction is similar to ghostpdl:devices/vector/gdevtxtw.c and mupdf's stext.
  We only join spans, lines and paragraphs which have the same ctm matrix.
Line: 59 to 41
  Creating a .docx file involves:
Changed:
<
<
  • Using an internal template .docx file.
  • writing paragraphs as XML into the tree's word/document.xml file.
  • zipping up the tree into the new .docx file.
>
>
  • Using an internal template .docx file to create an in-memory file tree.
  • Writing paragraphs as XML into the tree's word/document.xml file.
  • Copying image files into the tree.
  • Creating internal references to these image files.
  • Zipping up the tree to form the generated .docx file.

We currently use no compression when zipping. The resulting .docx files appear to load fine in Word and LibreOffice.

Repository with source for extract.exe, test files etc.

API for extract.

 

Revision 162020-10-08 - JulianSmith

Line: 1 to 1
 
META TOPICPARENT name="JulianSmith"

Converting pdf to docx.

Added:
>
>

2020-10-8:

Can now copy images into .docx files. We don't currently try to position them similarly to the PDF.

 

2020-10-5:

Improved rotated text. E.g. see:

Revision 152020-10-05 - JulianSmith

Line: 1 to 1
 
META TOPICPARENT name="JulianSmith"

Converting pdf to docx.

Added:
>
>

2020-10-5:

Improved rotated text. E.g. see:

Compare with Acrobat output for same file:

 

2020-10-2:

We can now put rotated text into a rotated box.

Revision 142020-10-02 - JulianSmith

Line: 1 to 1
 
META TOPICPARENT name="JulianSmith"

Converting pdf to docx.

Changed:
<
<

2020-10-1:

>
>

2020-10-2:

 
Changed:
<
<
In-progress results with rotated text in .docx file:
>
>
We can now put rotated text into a rotated box.
 
Changed:
<
<

2020-9-24:

Results using Adobe Acrobat DC:

2020-9-9:

Text extraction works with Python2.pdf and ghostscript. There are still some minor differences in the gs/mupdf output.

Directory containing .docx files with text extracted using gs and mutool to generate intermediate information, then using extract.exe to piece together text spans and write into .docx files:

>
>
Directory containing .pdf test files and .docx files with text extracted using gs and mutool to generate intermediate information, then using extract.exe to piece together text spans and write into .docx files:
 
Deleted:
<
<
Test files:

Results using intermediate information from mutool draw -F xmltext:

Results using intermediate information from gs -sDEVICE=txtwrite -dTextFormat=4:

  Repository with source for extract.exe, test files etc:

Revision 132020-09-30 - JulianSmith

Line: 1 to 1
 
META TOPICPARENT name="JulianSmith"

Converting pdf to docx.

Added:
>
>

2020-10-1:

In-progress results with rotated text in .docx file:

 

2020-9-24:

Results using Adobe Acrobat DC:

Revision 122020-09-24 - JulianSmith

Line: 1 to 1
 
META TOPICPARENT name="JulianSmith"

Converting pdf to docx.

Added:
>
>

2020-9-24:

Results using Adobe Acrobat DC:

 

2020-9-9:

Text extraction works with Python2.pdf and ghostscript. There are still some minor differences in the gs/mupdf output.

Revision 112020-09-09 - JulianSmith

Line: 1 to 1
 
META TOPICPARENT name="JulianSmith"

Converting pdf to docx.

Changed:
<
<

2020-8-25:

>
>

2020-9-9:

 
Changed:
<
<
Text extraction now works with Python2.pdf and ghostscript, using ken's fix to txtwrite device font handling and some extra information gathering. There are still some minor differences in the gs/mupdf output.
>
>
Text extraction works with Python2.pdf and ghostscript. There are still some minor differences in the gs/mupdf output.
 
Changed:
<
<
Directory containing .docx files with text extracted using gs and mutool to generate intermediate information then using extract.exe to piece together text spans and write into a .docx file:
>
>
Directory containing .docx files with text extracted using gs and mutool to generate intermediate information, then using extract.exe to piece together text spans and write into .docx files:
 
Changed:
<
<
Results using intermediate information from mutool draw -F raw:
>
>
Test files:
 
Changed:
<
<
>
>
 
Changed:
<
<
Results using intermediate information from gs -sDEVICE=txtwrite -dTextFormat=4 (using recently-modified txtwrite code):
>
>
Results using intermediate information from mutool draw -F xmltext:
 
Changed:
<
<
>
>
 
Changed:
<
<
Repository with source for extract.exe, test files etc:
>
>
Results using intermediate information from gs -sDEVICE=txtwrite -dTextFormat=4:
 
Changed:
<
<
>
>
 
Changed:
<
<

>
>
Repository with source for extract.exe, test files etc:
 
Changed:
<
<

2020-7-28:

>
>
 
Changed:
<
<
Separate repository with extraction code, test files, reference output, and beginnings of reader for output of gs -sDEVICE=txtwrite:
>
>
API for extract is:
 
Changed:
<
<
>
>
 
Deleted:
<
<

2020-7-14: example conversion:

See:

  The general approach here is:
Changed:
<
<
  • Use a new raw mupdf device to output a PDF file's native low-level spans (sequences of glyphs in a single font) as XML; in future it could also output images.
>
>
  • Use new xmltext mupdf device to output a PDF file's native low-level spans (sequences of glyphs in a single font) as XML; in future it could also output images.
 
  • Run a separate external programme that reads this XML and applies various heuristics to extract text as paragraphs which are written into an output .docx file.
Changed:
<
<
The raw device is implemented in: mupdf:source/fitz/raw-device.c

The separate external programme is implemented in mupdf:source/tools/extract_text.c, and works like this:

>
>
The separate external programme works like this:
 
  • Reads the XML created by the raw device.
  • Split the low-level spans where glyphs are not adjacent.
Line: 69 to 57
  Creating a .docx file involves:
Changed:
<
<
  • unzipping a template .docx file into a directory tree.
>
>
  • Using an internal template .docx file.
 
  • writing paragraphs as XML into the tree's word/document.xml file.
  • zipping up the tree into the new .docx file.
Deleted:
<
<
mupdf:source/tools/stext.c does all of this. It also can use a local copy of mupdf's stext.c to extract text, for comparison.

mupdf:scripts/ptodoc.py builds and runs things.

 

-- Julian Smith - 2020-06-29

Revision 102020-08-25 - JulianSmith

Line: 1 to 1
 
META TOPICPARENT name="JulianSmith"

Converting pdf to docx.

Changed:
<
<
--++ 2020-8-25:
>
>

2020-8-25:

  Text extraction now works with Python2.pdf and ghostscript, using ken's fix to txtwrite device font handling and some extra information gathering. There are still some minor differences in the gs/mupdf output.

Revision 92020-08-25 - JulianSmith

Line: 1 to 1
 
META TOPICPARENT name="JulianSmith"

Converting pdf to docx.

Changed:
<
<

2020-7-30:

>
>
--++ 2020-8-25:

Text extraction now works with Python2.pdf and ghostscript, using ken's fix to txtwrite device font handling and some extra information gathering. There are still some minor differences in the gs/mupdf output.

  Directory containing .docx files with text extracted using gs and mutool to generate intermediate information then using extract.exe to piece together text spans and write into a .docx file:

Revision 82020-07-30 - JulianSmith

Line: 1 to 1
 
META TOPICPARENT name="JulianSmith"

Converting pdf to docx.

Changed:
<
<
2020-7-29:
>
>

2020-7-30:

 
Changed:
<
<
Directory containing .docx files with text extracted using gs and mutool to generate intermediate information:
>
>
Directory containing .docx files with text extracted using gs and mutool to generate intermediate information then using extract.exe to piece together text spans and write into a .docx file:
 
Changed:
<
<
Running .pdf through mutool draw -F raw to extract text spans, then run through extract.exe to process into paragraphs and generate a .docx file:
>
>
Results using intermediate information from mutool draw -F raw:
 
Changed:
<
<
Running .pdf through gs -sDEVICE=txtwrite -dTextFormat=0 (using slightly modified txtwrite code) to extract text spans, then run through extract.exe to process into paragraphs and generate a .docx file:
>
>
Results using intermediate information from gs -sDEVICE=txtwrite -dTextFormat=4 (using recently-modified txtwrite code):
 
Added:
>
>
Repository with source for extract.exe, test files etc:

 
Changed:
<
<
2020-7-28: separate repository with extraction code, test files, reference output, and beginnings of reader for output of gs -sDEVICE=txtwrite:
>
>

2020-7-28:

Separate repository with extraction code, test files, reference output, and beginnings of reader for output of gs -sDEVICE=txtwrite:

 


Changed:
<
<
2020-7-14: example conversion:
>
>

2020-7-14: example conversion:

 

Revision 72020-07-29 - JulianSmith

Line: 1 to 1
 
META TOPICPARENT name="JulianSmith"

Converting pdf to docx.

Added:
>
>
2020-7-29:

Directory containing .docx files with text extracted using gs and mutool to generate intermediate information:

Running .pdf through mutool draw -F raw to extract text spans, then run through extract.exe to process into paragraphs and generate a .docx file:

Running .pdf through gs -sDEVICE=txtwrite -dTextFormat=0 (using slightly modified txtwrite code) to extract text spans, then run through extract.exe to process into paragraphs and generate a .docx file:


 2020-7-28: separate repository with extraction code, test files, reference output, and beginnings of reader for output of gs -sDEVICE=txtwrite:

Added:
>
>

 2020-7-14: example conversion:

Revision 62020-07-28 - JulianSmith

Line: 1 to 1
 
META TOPICPARENT name="JulianSmith"

Converting pdf to docx.

Added:
>
>
2020-7-28: separate repository with extraction code, test files, reference output, and beginnings of reader for output of gs -sDEVICE=txtwrite:

 2020-7-14: example conversion:

Revision 42020-07-20 - JulianSmith

Line: 1 to 1
 
META TOPICPARENT name="JulianSmith"

Converting pdf to docx.

Line: 13 to 13
 
Changed:
<
<
The general approach here is to use a modified mutool stext device to output a PDF file's native low-level sequences of glyphs (in a single font) , then apply various heuristics to:
>
>
The general approach here is:
 
Changed:
<
<
  • split glyph sequences into spans where the glyphs are all adjacent.
  • join spans into lines.
  • join lines into paragraphs.
>
>
  • Use a new raw mupdf device to output a PDF file's native low-level spans (sequences of glyphs in a single font) as XML; in future it could also output images.
  • Run a separate external programme that reads this XML and applies various heuristics to extract text as paragraphs which are written into an output .docx file.

The raw device is implemented in: mupdf:source/fitz/raw-device.c

The separate external programme is implemented in mupdf:source/tools/extract_text.c, and works like this:

  • Reads the XML created by the raw device.
  • Split the low-level spans where glyphs are not adjacent.
  • Join spans into lines.
  • Join lines into paragraphs.
  • Create a .docx file containing these paragraphs.

Text extraction is similar to what ghostpdl:devices/vector/gdevtxtw.c and mupdf's stext device does.

  We only join spans, lines and paragraphs which have the same ctm matrix.

We also do some extra processing such as removing spurious spaces which are overlapped by the next glyph, removing '-' at the end of lines etc.

Changed:
<
<
The text that forms our paragraphs has no layout information and so is suitable for embedding (as XML) directly inside a .docx file (as word/document.xml prior to zipping into the .docx file).
>
>
Creating a .docx file involves:

  • unzipping a template .docx file into a directory tree.
  • writing paragraphs as XML into the tree's word/document.xml file.
  • zipping up the tree into the new .docx file.
  mupdf:source/tools/stext.c does all of this. It also can use a local copy of mupdf's stext.c to extract text, for comparison.

Revision 32020-07-14 - JulianSmith

Line: 1 to 1
 
META TOPICPARENT name="JulianSmith"

Converting pdf to docx.

Added:
>
>
2020-7-14: example conversion:

 See:

Changed:
<
<
>
>

The general approach here is to use a modified mutool stext device to output a PDF file's native low-level sequences of glyphs (in a single font) , then apply various heuristics to:

  • split glyph sequences into spans where the glyphs are all adjacent.
  • join spans into lines.
  • join lines into paragraphs.

We only join spans, lines and paragraphs which have the same ctm matrix.

 
Changed:
<
<
The general approach here is to run mutool.py draw -F stext to get an XML representation of the lines and blocks of text in a .pdf document, then apply simple heuristics to turn this into 'runs' of text in the same font, which can then be readily turned into something that can be embedded inside a .docx file.
>
>
We also do some extra processing such as removing spurious spaces which are overlapped by the next glyph, removing '-' at the end of lines etc.
 
Changed:
<
<
mupdf:source/tools/ptodoc.c does it in C.
>
>
The text that forms our paragraphs has no layout information and so is suitable for embedding (as XML) directly inside a .docx file (as word/document.xml prior to zipping into the .docx file).
 
Changed:
<
<
mupdf:scripts/ptodoc.py does this in Python, and also runs some tests, including building and running source/tools/ptodoc.c.
>
>
mupdf:source/tools/stext.c does all of this. It also can use a local copy of mupdf's stext.c to extract text, for comparison.
 
Changed:
<
<
Example conversion from ghostpdl:zlib/zlib.3.pdf: https://ghostscript.com/~julian/ptodoc-c.docx
>
>
mupdf:scripts/ptodoc.py builds and runs things.
 

Revision 22020-07-06 - JulianSmith

Line: 1 to 1
 
META TOPICPARENT name="JulianSmith"

Converting pdf to docx.

Line: 13 to 13
  mupdf:scripts/ptodoc.py does this in Python, and also runs some tests, including building and running source/tools/ptodoc.c.
Added:
>
>
Example conversion from ghostpdl:zlib/zlib.3.pdf: https://ghostscript.com/~julian/ptodoc-c.docx
 

-- Julian Smith - 2020-06-29

Revision 12020-06-29 - JulianSmith

Line: 1 to 1
Added:
>
>
META TOPICPARENT name="JulianSmith"

Converting pdf to docx.

See:

The general approach here is to run mutool.py draw -F stext to get an XML representation of the lines and blocks of text in a .pdf document, then apply simple heuristics to turn this into 'runs' of text in the same font, which can then be readily turned into something that can be embedded inside a .docx file.

mupdf:source/tools/ptodoc.c does it in C.

mupdf:scripts/ptodoc.py does this in Python, and also runs some tests, including building and running source/tools/ptodoc.c.


-- Julian Smith - 2020-06-29

Comments

<--/commentPlugin-->
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright 2014 Artifex Software Inc