We're currently in the process using Plone as a document management system and we're currently migrating 160GB of files to plone through the JSON api. The structure is quite simple
/docs/customers//.
/docs/complaints/.
I've built a python script to do the upload but I started noticing that the webservice starts to slows down dramatically after a certain gigabytes. It's currently allocated:
It's on a RHEL7 with 62GB of ram and Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz which is a VS instance in a VMware stack.
I'm a newbie with ZODB but I suspect this is the place I should start looking.
Any advice is appreciated and I have read the Performance documentation but I don't think it's leading me to a solution using that.
That's unexpected. Does it slow down permanently (restart has no effect)? Do you use the default types (with blob tile fields)? What kind of indexing do you do for the files?
Also how many documents that makes for a single folder. Folders themselves are btree based and should scale, but there may be something else (possibly even in the rest api), which slows down with many objects in a single folder.
We've tried the good ol restart of the service and unfortunately no luck. We've also tried packing the database and it's the same result. I don't think we've got any indexing on this one since the 'Architect' just simply did a vanilla install.
Would indexing be a good starting point?
As for the folders.
the root /docs and /complaints have roughtly about 86605 sub folders. Each subfolders have one or two child folders then each child folders would have a file attached.
This a sample python post statement we use to upload to Plone using JSON API. I've filled in the variables already to portray the outcome. We're uploading these massive files locally on the server.
Could you describe in more detail, how does the service "slow down"? How much?Just the scripted upload gets slower at some point or also use of Plone with browser? If only upload, is it still slow, if you change the upload folder?
And if it's the upload, would you know, which JSON api implementation you are using? "plone.restapi", "plone.jsonapi", or something else? And which Plone version are you using? (Plone does not ship with JSON api yet, so it has to be an add-on.)
How big is your ZODB Data.fs? (If everything is as it should, Data.fs should be relatively small, because all files are stored as such in "blobstorage" next to "filestorage" (where Data.fs is).
I'm still trying to understand, where the bottleneck really is. Is it ZODB, configuration, JSON api or Plone folder implementation.
One generic issue with that amount of pages in Plone is too small number as ZODB connection "cache-size". If you came find "zope.conf" for your Plone, you should be able to find it, set it to some very big number and restart the service to see if it has any effect. (You may be able to find connection cache usage stats with browser, if you find the Zope "Control Panel". Either at /manage or /aq_parent/manage depending of your configuration.)
Could you describe in more detail, how does the service "slow down"? How much?Just the scripted upload gets slower at some point or also use of Plone with browser? If only upload, is it still slow, if you change the upload folder?
_---- Yes, you're right. The scripted upload goes slow but I haven't thought of using the web browser when the problem occurred. I've also tried changing the target folder and the HTTP 200 takes roughly about 1-3 seconds based on the delta time in my records._
And if it's the upload, would you know, which JSON api implementation you are using? "plone.restapi", "plone.jsonapi", or something else? And which Plone version are you using? (Plone does not ship with JSON api yet, so it has to be an add-on.)
I've used the API available on plone.jsonapi.routes
How big is your ZODB Data.fs? (If everything is as it should, Data.fs should be relatively small, because all files are stored as such in "blobstorage" next to "filestorage" (where Data.fs is).
Data.fs was at 42GB BlobStorage folder was at 35GB
I'm still trying to understand, where the bottleneck really is. Is it ZODB, configuration, JSON api or Plone folder implementation.
One generic issue with that amount of pages in Plone is too small number as ZODB connection "cache-size". If you came find "zope.conf" for your Plone, you should be able to find it, set it to some very big number and restart the service to see if it has any effect. (You may be able to find connection cache usage stats with browser, if you find the Zope "Control Panel". Either at /manage or /aq_parent/manage depending of your configuration.)
Here's what we're using in the zope.conf each clients are using
Total number of objects in the database
2981339
Total number of objects in memory from all caches
4101734
Target number of objects in memory per cache
3000000
Target memory size per cache in bytes
0
Total number of objects in each cache:
Cache Name Number of active objects Total active and non-active objects
<Connection at 045dbf50> 1101734 1696866
<Connection at 7f2cf77eded0> 3000000 3526483
Total 4101734
What is the real purpose and usage of Plone. Do you really need Plone in your context for content management, for web content management or are you misusing Plone as data sink that is unrelated to CMS functionality?
We are intending to use it as a document management.. We've integrated it with another system via jsonapi. The upstream uses the post, upload and get api's
I think you have a valid use case and this should work as you expect. Using the JSON API for import is also ok.
The second connection exceeds the number in the cache. This is a hint that it wakes up more objects than probably needed in some requests.
In order to get better information abouts whats going on I'd fill the DB up to an point when it slows down the upload and then take Products.ZopeProfiler to get data where it takes long or many cycles in order to get a certain information needed in the upload process.
The buildout shows that you have your custom content-types and custom JSON API (neither released as open source), so we can only guess the real issue here.
For me, it is surprising that your Data.fs (which you mentioned being packed already) is larger than your BlobStorage, even that your data is mainly blobs (document files). Together with large number of active objects, I'd guess that you have a lot of indexing for the uploaded documents, and it's possible that the built-in Plone catalog (portal_catalog/ZCatalog) becomes the bottleneck. It should be possible to confirm that with profiling.
Of course, at first, you could try simply setting higher value for cache-size in your zope.conf. (Or if you can run buildout, that's probably in [client_base] section as zodb-cache-size.)
If Plone catalog is the bottleneck, the most complete "drop-in alternative" is Apache SOLR with the community provided collective.solr integration package for Plone.
There are better tools than Plone for doing document management. Look into Alfresco unless you have serious reasons for using Plone. You can install Alfresco easily and pump tons of data into Alfresco without problems over WebDAV, FTP or their own API. ...much easier than (mis)using Plone as a data sink.
Since your Data.fs is bigger than your blob storage:
Does your custom content types have 'file' or 'image'(s) and you are not using blob storage for them?