Mass Adding Files to Plone 5.1.6: Database and Metadata Questions

irfon · January 20, 2020, 5:25pm

Hi, everyone!

We're looking at bulk adding a large number (some thousands) of files to our Plone site. They will be primarily PDF files. One concern we have is about database bloat, and another concern is how to manage assigning useful metadata to all of those files (which at present are stored on a simple filesystem).

To attack the first question, I verified that our instances (dev and prod) of Plone have blob storage configured, which they do. I then uploaded 256 PDFs totalling approximately 135 MB through the built-in "Upload" functionality into a folder.

The "var/blobstorage" folder increased from 15 MB to 151 MB, as expected -- pretty much exactly the upload size. So far so good.

However, the var/filestorage folder also increased from 83 MB to 170 MB. That's not the fill size of the files (it's only 87 MB), but it's a non-trivial amount. Given the PDFs are full-text searchable, is that likely to be just the full-text-search data, or is there something else going on? (If it's the full-text data that's not so bad -- we really want that functionality, plus many of our documents will be CAD drawings, so they won't be as text-heavy as these were.)

I just want to make sure I'm not missing something about this process.

Also, should I even be worried about the database across this sort of operation, or does Plone just take this all in stride? I don't know how big a deal dropping a few thousand PDFs on a Plone site really is.

Okay, for the metadata part, our ideal scenario would be that we can provide a CSV template that the managers of the departments in question can open in Excel, which they can populate with the relevant data, and which we can then map onto the uploaded files. I see that something like that was actually being worked on through GSoC extensions to collective.importexport, but I don't see anything about it being complete. Is there some reasonable way of doing this? I don't think it's something we're going to do every day, so the solution doesn't have to be hyper-elegant, and it can be something IT needs to do (so long as the people providing the data can use Excel or something similarly familiar for that step) but I'm not really at "develop my own Plone add-on" skill level at this time either.

One last database question:

This scenario is an extension of the original concept for the site, and we didn't allocate enough disk for quite this much file storage at the time. If I shut down the site, add a new virtual disk, move the var folder over there, simlink it back to its original location, and start up the site again, should it be more or less okay? Or is it better to modify the buildout.cfg to point at the new paths, re-run the buildout, then move the folders over? Or something else?

Many thanks for any help that you can provide. Thanks!

zopyx · January 20, 2020, 5:36pm

Do you need your PDF to be indexable and searchable or would it be sufficient to throw your PDF into a local filesystem and mount it into Plone in order to browse and download the mounted hierarchy?

irfon · January 20, 2020, 5:38pm

I believe that having them be indexable and searchable is a significant plus, but not 100% necessary.

zopyx · January 20, 2020, 5:44pm

Look at

if mounting a filesystem is good enough for you.

irfon · January 20, 2020, 5:49pm

Thanks! That's an interesting option. I think that it might not be viable for this particular project in the sense that version control is also a requirement for the files, which I assume is not provided by this connector? In any case, I'll definitely take a look and see how it works for us.

zopyx · January 20, 2020, 5:53pm

There is no version support and no tight integration with Plone by-design.

irfon · January 20, 2020, 6:15pm

Yeah, unfortunately that's likely too big a trade-off for us. But it's a cool add-on nonetheless.

zopyx · January 21, 2020, 4:07am

Ideally you run Plone as client-server setup (ZEO client-server). Move your var folder to a disk where you have enough space or where you can easily increase the disk resources (if running on a VM). Moving the var folder to a different place should not be an issue. As recognized, you may see some data bloat perhaps from indexing etc..I would not care much. Packing the ZODB from time to time is the way to go. Regarding your metadata/CSV question: likely something that would have to be coded. In doubt get in touch with an experienced Plone developer or integrator.

djay · January 21, 2020, 4:55am

I wrote collective.importexport. Its designed to work for your usecase but it's method of import probably doesn't work in modern plone versions. There is also plone.importexport which was written by a GSOC student who is looking for work and would love to finish it (hint hint). We'd love to see it finished by don't have the budget for it right now. It does work for some cases. It might work for what you are trying to do. I'm not sure (@Shriyanshagro?) . There are also a few other add ons that allow importing from excel files etc. YMMV.
Your other options are transmogrifier which is well tested but requires some configuration and learning. There is CSV plugins for this.
Or just write a script as @zopyx suggests. The plone api is not so hard that the coding required might be less than you think. Unfortunately the authors of both dexterity and plone.api managed to make it such that its impossible to set properties of a dexterity object in an TTW python script so you'd have to write a script that runs via bin/instance run instead (and deal with transactions yourself).

irfon · January 21, 2020, 12:39pm

Thank you for the response. That's exactly what we were thinking: The server is a VM, so we're thinking to move the var/filestorage and var/blobstorage folders to separate virtual disks that we can expand/contract as needed. (We could just expand the disk it's on, but that gives us a bit more fine-grained control, I suppose. The infrastructure guy seems happier with that arrangement, at least.)

I'll have to poke at the API a bit more about the rest, but thanks for your help!

I wish I could offer to help with that aspect! All of my suggestions to hire developers on this project so far have been rebuffed, so I've been muddling through it on my own.

Okay, I had come across that one, so I'll investigate it further.

Already I have no idea what running a script via bin/instance means. But thank you for the information! It gives me something to go on at least.