Procedure for pasting word-processor content into Page object

Chi · February 10, 2017, 3:36am

I have a set of word processing documents (in Word) that I need to paste into a RichText field of a Page object. The problem, of course, is preserving the formatting. On top of that, however, is the need to preserve hyperlinks between endnotes and their respective call-outs in the document, proper.

We've had limited success in simply copying/pasting the content from Word to Plone. Note, Plone does a pretty good job when the Word document is a few pages or less. Unfortunately, the documents that I'm looking at are 40 to 400+ pages long, sometimes with many dozens of endnotes. Consequently, initial accuracy is paramount.

I'm just looking for a reliable method for transferring the content with minimal need for correction. Any ideas?

zopyx · February 10, 2017, 3:59am

Plone - so be precise HTML - has no notion of footnotes, call-outs etc. - and copy/paste is the most simple and most lame approach trying to achieve this goal. Given the complexity and crudeness of the DOCX format, there answer is: there is no simple solution and there is no free solution.

Read what we did in our "Onkopedia" project: https://www.xml-director.info/files/Onkopedia-EN.pdf

There is no free beer in the publishing world. If you have professional requirements then you need a professional approach and a professional tool chain. In our case we partnered with www.c-rex.net for dealing with all the DOCX conversion stuff.._and: there is no generic solution that fits all purposes out of the box. What we learned from this project is that you need project specific DOCX parsing and processing based on an agreed template and functionality for getting the most out of a DOCX document. In our particular project wer are in the nice situation going from DOCX to XML and all related output formats like EPUB, PDF and HTML and lossless back to DOCX.
But as said: no free beer.

Andreas

dieter · February 10, 2017, 10:28am

"Word" might be able to export documents as HTML (as does "libreOffice"). Hopefully, it thereby does a decent job (such that you do not lose much information and do not get errors). After such an export, you could import the HTML documents - maybe via a script or in a view.

Likely, you will need to tweak the HTML-filtering of your site to avoid that important parts of the HTML generated for your word documents are filtered out. You might also need to remove the "head" element from the generated HTML and put relevant parts into the CSS of your site.

espenmn · February 10, 2017, 2:10pm

There are things like pandas, which might be able to help (maybe some manual work is needed).
http://pandoc.org

I have used it to convert docx to markdown, in which case the endnotes worked
(tables are a problem, though)

zopyx · February 10, 2017, 2:27pm

Pandoc is a toy for tinkerers.