Degrading upload performance using JSON API

Hi,

We're currently in the process using Plone as a document management system and we're currently migrating 160GB of files to plone through the JSON api. The structure is quite simple
/docs/customers//.
/docs/complaints/.

I've built a python script to do the upload but I started noticing that the webservice starts to slows down dramatically after a certain gigabytes. It's currently allocated:

It's on a RHEL7 with 62GB of ram and Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz which is a VS instance in a VMware stack.

I'm a newbie with ZODB but I suspect this is the place I should start looking.

Any advice is appreciated and I have read the Performance documentation but I don't think it's leading me to a solution using that.

Thank you.

That's unexpected. Does it slow down permanently (restart has no effect)? Do you use the default types (with blob tile fields)? What kind of indexing do you do for the files?

Also how many documents that makes for a single folder. Folders themselves are btree based and should scale, but there may be something else (possibly even in the rest api), which slows down with many objects in a single folder.

Hi Asko,

We've tried the good ol restart of the service and unfortunately no luck. We've also tried packing the database and it's the same result. I don't think we've got any indexing on this one since the 'Architect' just simply did a vanilla install.
Would indexing be a good starting point?

As for the folders.
the root /docs and /complaints have roughtly about 86605 sub folders. Each subfolders have one or two child folders then each child folders would have a file attached.

e.g.

   /docs/customer/123456
                |----- file.pdf
                |----- file.pdf
 /complaints/SR-123456/
               |--- some-hashed-folder-name-from-MSCRM/hashed-file-name-1.filetype/hashed-file-name-1.filetype 
               |--- some-hashed-folder-name-from-MSCRM/hashed-file-name-2.filetype/hashed-file-name-2.filetype 
 /complaints/SR-555555/
               |--- some-hashed-folder-name-from-MSCRM/hashed-file-name-1.filetype/hashed-file-name-1.filetype 
               |--- some-hashed-folder-name-from-MSCRM/hashed-file-name-2.filetype/hashed-file-name-2.filetype 

Thanks for the reply!

This a sample python post statement we use to upload to Plone using JSON API. I've filled in the variables already to portray the outcome. We're uploading these massive files locally on the server.

            post_template = {
                'parent_uid':'',
                'title':
                'file': 6868561412-490120-195151-259125125.pdf
                'parent_path': /complaints/SR-1234567/6868561412-490120-195151-259125125.pdf
                'bill_number':'',
                'account_id':'',
                'filename': <some base64 stuff here>
                'description': 6868561412-490120-195151-259125125.pdf
              }

Thank you

Could you describe in more detail, how does the service "slow down"? How much?Just the scripted upload gets slower at some point or also use of Plone with browser? If only upload, is it still slow, if you change the upload folder?

And if it's the upload, would you know, which JSON api implementation you are using? "plone.restapi", "plone.jsonapi", or something else? And which Plone version are you using? (Plone does not ship with JSON api yet, so it has to be an add-on.)

How big is your ZODB Data.fs? (If everything is as it should, Data.fs should be relatively small, because all files are stored as such in "blobstorage" next to "filestorage" (where Data.fs is).

I'm still trying to understand, where the bottleneck really is. Is it ZODB, configuration, JSON api or Plone folder implementation.

One generic issue with that amount of pages in Plone is too small number as ZODB connection "cache-size". If you came find "zope.conf" for your Plone, you should be able to find it, set it to some very big number and restart the service to see if it has any effect. (You may be able to find connection cache usage stats with browser, if you find the Zope "Control Panel". Either at /manage or /aq_parent/manage depending of your configuration.)

Could you describe in more detail, how does the service "slow down"? How much?Just the scripted upload gets slower at some point or also use of Plone with browser? If only upload, is it still slow, if you change the upload folder?

_---- Yes, you're right. The scripted upload goes slow but I haven't thought of using the web browser when the problem occurred. I've also tried changing the target folder and the HTTP 200 takes roughly about 1-3 seconds based on the delta time in my records._

And if it's the upload, would you know, which JSON api implementation you are using? "plone.restapi", "plone.jsonapi", or something else? And which Plone version are you using? (Plone does not ship with JSON api yet, so it has to be an add-on.)

----- Here are the version we're running:

Plone 4.3.8 (4312)
CMF 2.2.9
Zope 2.13.23
Python 2.7.5 (default, Feb 11 2014, 07:46:25) [GCC 4.8.2 20140120 (Red Hat 4.8.2-13)]
PIL 3.0.0 (Pillow)

I've used the API available on plone.jsonapi.routes

How big is your ZODB Data.fs? (If everything is as it should, Data.fs should be relatively small, because all files are stored as such in "blobstorage" next to "filestorage" (where Data.fs is).

Data.fs was at 42GB
BlobStorage folder was at 35GB

I'm still trying to understand, where the bottleneck really is. Is it ZODB, configuration, JSON api or Plone folder implementation.

One generic issue with that amount of pages in Plone is too small number as ZODB connection "cache-size". If you came find "zope.conf" for your Plone, you should be able to find it, set it to some very big number and restart the service to see if it has any effect. (You may be able to find connection cache usage stats with browser, if you find the Zope "Control Panel". Either at /manage or /aq_parent/manage depending of your configuration.)

Here's what we're using in the zope.conf
each clients are using

<zodb_db main> cache-size 3000000**
 <zeoclient>       cache-size 4096MB**

Finally found the password for the ZMI...

Here's what I can see from the Control Panel:

Total number of objects in the database
2981339
Total number of objects in memory from all caches
4101734
Target number of objects in memory per cache
3000000
Target memory size per cache in bytes
0
Total number of objects in each cache:


Cache Name                                 Number of active objects 	Total active and non-active objects
    <Connection at 045dbf50>	                1101734				1696866
    <Connection at 7f2cf77eded0>	        3000000				3526483
    Total  					4101734

I'm hoping that the buildout.cfg would help you understand how it was put together?

[buildout]
extends =
base.cfg
versions.cfg
install-from-cache = true

find-links +=
http://dist.plone.org/release/4.3.8

effective-user = plone_daemon
buildout-user = plone_buildout
need-sudo = yes

eggs =
Plone
Pillow
dms.contenttypes
dms.jsonapi

zcml =

develop =

var-dir=${buildout:directory}/var

backups-dir=${buildout:var-dir}

user=admin:blehbleh

deprecation-warnings = off
verbose-security = off

parts =
zeoserver
client1
client2
client3
client4
client5
backup
zopepy
unifiedinstaller
precompiler
setpermissions

[zeoserver]
<= zeoserver_base
recipe = plone.recipe.zeoserver
zeo-address = 127.0.0.1:8100

[client1]
<= client_base
recipe = plone.recipe.zope2instance
zeo-address = ${zeoserver:zeo-address}
http-address = 8090

[client2]
<= client_base
recipe = plone.recipe.zope2instance
zeo-address = ${zeoserver:zeo-address}
http-address = 8081

[client3]
<= client_base
recipe = plone.recipe.zope2instance
zeo-address = ${zeoserver:zeo-address}
http-address = 8082

[client4]
<= client_base
recipe = plone.recipe.zope2instance
zeo-address = ${zeoserver:zeo-address}
http-address = 8083

[client5]
<= client_base
recipe = plone.recipe.zope2instance
zeo-address = ${zeoserver:zeo-address}
http-address = 8084

[versions]
zc.buildout = 2.5.0
setuptools = 20.1.1
Pillow = 3.0.0
MarkupSafe = 0.23
Products.DocFinderTab = 1.0.5
bobtemplates.plone = 1.0.1
buildout.sanitycheck = 1.0.2
collective.checkdocs = 0.2
collective.recipe.backup = 3.0.0
mr.bob = 0.1.2
pkginfo = 1.2.1
plone.recipe.unifiedinstaller = 4.3.2
requests = 2.9.1
requests-toolbelt = 0.6.0
twine = 1.6.5
zest.pocompile = 1.4
colorama = 0.3.7

What is the real purpose and usage of Plone. Do you really need Plone in your context for content management, for web content management or are you misusing Plone as data sink that is unrelated to CMS functionality?

-aj

We are intending to use it as a document management.. We've integrated it with another system via jsonapi. The upstream uses the post, upload and get api's

I think you have a valid use case and this should work as you expect. Using the JSON API for import is also ok.

The second connection exceeds the number in the cache. This is a hint that it wakes up more objects than probably needed in some requests.

In order to get better information abouts whats going on I'd fill the DB up to an point when it slows down the upload and then take Products.ZopeProfiler to get data where it takes long or many cycles in order to get a certain information needed in the upload process.

The buildout shows that you have your custom content-types and custom JSON API (neither released as open source), so we can only guess the real issue here.

For me, it is surprising that your Data.fs (which you mentioned being packed already) is larger than your BlobStorage, even that your data is mainly blobs (document files). Together with large number of active objects, I'd guess that you have a lot of indexing for the uploaded documents, and it's possible that the built-in Plone catalog (portal_catalog/ZCatalog) becomes the bottleneck. It should be possible to confirm that with profiling.

Of course, at first, you could try simply setting higher value for cache-size in your zope.conf. (Or if you can run buildout, that's probably in [client_base] section as zodb-cache-size.)

If Plone catalog is the bottleneck, the most complete "drop-in alternative" is Apache SOLR with the community provided collective.solr integration package for Plone.

Given that all those PDFs are indexed the TextIndex is a candidate. It's worth to dig deeper here and if so give Solr a try,

I would:

  • stop plone
  • make a backup of Data.fs and blobstorage
  • start plone
  • remove textindex from catalog
  • try again and see if it slows down
  • stop plone
  • put the backup back
  • start plone

Or just work in a test instance with a copy of Data.fs/blobstorage. But maybe it is better to use Products.ZopeProfiler and see what happen.

There are better tools than Plone for doing document management. Look into Alfresco unless you have serious reasons for using Plone. You can install Alfresco easily and pump tons of data into Alfresco without problems over WebDAV, FTP or their own API. ...much easier than (mis)using Plone as a data sink.

-aj

Thank you guys. I will try all your suggestions and come back with an outcome.

Slightly unrelated, and just a wild guess.....

Since your Data.fs is bigger than your blob storage:
Does your custom content types have 'file' or 'image'(s) and you are not using blob storage for them?

I'm not sure, I'll have to get back to you. I have a confirmation from the guy who built it and it's definitely a custom api as I was told.

@mariel can you please report on which change solved your problems and what issue you found was responsible?

1 Like