Document type w/ MS Office docs' cover page & metadata

petri · May 13, 2016, 6:14am

When uploading MS Office docs, I would like Plone to automatically set the embedded cover page image as the document lead image and copy over Dublin Core document metadata.

I am hoping to accomplish the extraction by first extending openxmllib and perhaps Products.OpenXml.

Then, to store the document, I am thinking a Dexterity content type (say, "Document") based on plone.app.contenttypes File (with IDublinCore & ILeadImage behaviors enabled) would be sufficient. From there, I assume that I could subscribe for IObjectCreatedEvent to populate the lead (cover) image & metadata fields. At least initially, all the fields with copied-over data would be read only.

Are there some better ways or pieces I am missing?

It would be nice if one could upload multiple documents at once. I seem to remember Plone5 has this feature built-in? Even better, I would like this to support bulk file management via WebDav. However I found out that when uploading files to Plone (4.3) via WebDav, file names with non-ASCII characters seem to always get rejected. Is that a known problem or am I to blame my WebDav client (ExpanDrive on OSX) for that? Any existing solutions?

Later, it would of course (?) make sense for a Document type such as described here to have pluggable support for any document type (PDF to start with). If someone with more experience would suggest a design for that, I'll be glad to use it. I can put in an interface & overridable adapter for cover & DC metadata extraction but I am guessing that is not enough. Or perhaps it is, if indexing & mime type support etc. would be provided by other third-party packages, similar to how Products.OpenXml provides those for MS Office XML docs?

Any comments & suggestions appreciated.

gyst · May 13, 2016, 6:43am

Take a look at collective.documentviewer for the image extraction. We use that extensively in ploneintranet.

espenmn · May 13, 2016, 8:32am

For the upload, you should take a look at:
wildcard foldercontents - https://pypi.python.org/pypi/wildcard.foldercontents/2.0b4

petri · May 20, 2016, 10:02am

Thanks for your suggestions. I ended up using PyPDF2 for splitting out just the cover (first) page, and Wand (ImageMagick wrapper) for converting to PNG. Works quite nicely, and easy to install, given that they can be installed as debian packages that in turn pull in all the dependencies.

I hope to be able to release the first version of the package (collective.filemeta) soon.

P.S. Could someone point out what package(s) in Plone produce the nice formatted (and sometimes even i18n'd) strings for file sizes, ie. "File size 12.4 MB" or similar?