Permanently increasing memory consumption by ZEO Clients

PeterB · April 17, 2020, 3:05pm

Since we installed ZEO a coulpe of week ago and moved our site that has the highest request rate there, we have a good performance regarding response times, but we have a big problem in terms of memory consumption by the ZEO Clients.

In less than 24 hours they grow up to the maximum of the avilale memory+swap space and crash, so that our nightly automatic restart is not sufficient enough.

We started with 3 clients and 10 GB memory. In meantime we have got 12 GB Memory, but still the same problem.
Even after I have changed the Apache loadbalancer, so that only 2 of the clients get user requests, those two clients increase in memory consumption permanently.

We have reduced the DB Cache to 25000 per thread, we have switched-off the PLONE Caching and we have played with differnet sizes for the cache-size in the section in zope.conf of the clients. Still the same problem.

We run Plone 5.0.10 (5020)

zope.conf for each client:

<zodb_db main>
# Main database
cache-size 25000

Blob-enabled ZEOStorage database

<zeoclient>
  read-only false
  read-only-fallback false
  blob-dir /home/users/mlpd/Plone_rfnews_zeo/zeocluster/var/blobstorage
  shared-blob-dir on
  server 127.0.0.1:8080
  storage 1
  name zeostorage
  var /home/users/mlpd/Plone_rfnews_zeo/zeocluster/parts/client3/var
  cache-size 200MB
</zeoclient>
mount-point /

</zodb_db>

Whatever I found in the docs of issues listing did not help me.
It is not the ZEO server that has problems, but it are the Clients that have this memory problem.

Is there a memory leak in the ZEO Clients code? I could not yet find anything about that.

What I do not really unterstand from the Docs:
https://zope.readthedocs.io/en/latest/zopebook/ZEO.html says:
"ZEO servers do not have any in-memory cache for frequently or recently accessed items."
But it also says about the cache for ZEO Clients: "This cache keeps frequently accessed objects in memory on the ZEO client."
Does this cache influence memory consumption or - as I understood untill now - only disc space usage?

Anyway, different Values for its size did not alter our problem.

Any help would be much appreciated.

fredvd · April 20, 2020, 9:30am

The zeoserver is only a small application and doesn't need a lot of memory. It's only purpose is persistence of all objects and synchronising object state between the zeoclients. The magic/work happens in the zeoclient: the more objects they can keep in memory the faster they are.

The most important parameter you already found is the maximum number of objects in the cache. This parameter is indifferent of the 'size' of each object. What happens over time is that different combinations of objects are loaded in the cache, slowly 'ballooning' the cache to its maximum size with objects of different sizes. The reserved memory for the cache will normally only grow, not shrink. Another factor that contributes to growing memory usage is memory fragmentation in long running Python process itself.

The second parameter: how many threads are you running each zeoclient at? (zserver-threads parameter in the recipe. The zeoclient memory cache is multiplied by the number of threads. There used to be defaults of 4 zserver-threads, but this is considered inefficiënt because of the global interpreter lock in the Python process. You can better run 4 zeoclients with 2 zserver-threads than 2 zeoclients with 4.

Some setups even use 1 thread/zeoclient, but you have to be sure that you don't initiate too many subrequests inside a request, this could cause deadlocks where 1 thread is waiting for another. If you use 3 or 4 threads, try first to go back to 2 threads per Zope process.

the Zope process growing is a normal situation, but it depends on custom code and activity as well. We have a site where we import a lot of content every night (and prune the old content). This site sees a lot more memory growth than other sites, probably because of memory fragmentation.

What can also contribute to faster than normal memory consumption increase are programming/custom code issues, for example where you accidentally wake up too many objects instead of relying on the catalog, load blob data into memory, or accidentally loading the output of a whole view in a template variable.

The zeo-client-cache-size is not the 'memory' cache for the zeoclient to operate on objects, but an extra intermediate (if you want persistent) disk cache between the zeoserver and eacth zeoclient, configured on the zeoclient side. I always thought this cache was a bit redundant if your zeoserver is on the same machine or very close to your zeoclients, especially if your zeoclients are running on virtual machines and storing something on disk cache causes another round trip on the SAN. But I found out this cache, if it made persistent, makes starting up a zeoclient much quicker because initial state for a lot of objects can still be in this cache and doesn't have to be loaded again from the zeoserver.

nielssteenkrogh · April 21, 2020, 11:24am

If you want add detailed knowledge for RAM usage in your custom python methods you can do like this

import resource
start_ram = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
end_ram = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss

delta_ram = end_ram - start_ram

logger.info(delta_ram)

We learnt a lot by doing this including lots of deletes to avoid RAM usage (but also other things than deletes):

del obj
del dicts
del lists
del portal_catalog_results
del mysql_connections

PeterB · April 21, 2020, 3:13pm

Fred, thanks a lot for this comprehensive explanations.

The number of threads we are running each zeoclient at is 2 - if I understand every thing correctly.
I cant find a number in the configuration and I understand, it defaults to 2 today (Based on https://pypi.org/project/plone.recipe.zope2instance/#advanced-zeo-options. And it seems that today the parm is threads rather than zserver-threads ).
I understand, that this number is equivalent to the number of caches in the ZMI Cache control panel, which looks like this in our case:

Is that correct?

What I will look into more deeply based on your explanation, is, to go into our custom code, that we have added.

PeterB · April 21, 2020, 3:16pm

Niels, thanks also for your hints.
I will try it to check our custom code.

nielssteenkrogh · April 22, 2020, 8:39am

Hi Peter, just glad we have similar problems and can exchange experience.

You can also inspect number of cached objects in your custom code using this code-snippet (giving you some numbers directly in your custom code similar to what you see in the control_panel)
APP_ = self.context.getPhysicalRoot()
MAIN_DB = APP_.Control_Panel.Database["main"]._getDB()
cachesizebefore = MAIN_DB.cacheSize()

.......

cachesizeafter = MAIN_DB.cacheSize()

This will give you details related to how your cache of objects grows for the individual zeo client in use.

We have in our code repeated lookup of same object by the user.
So our target is a zero-difference when having the repeated lookup.

After we reached the zero-difference, but still saw growing RAM usage we understood we also had to focus into other reasons for RAM-usage-growth

fredvd · April 23, 2020, 9:55am

Interesting, so where do you primarily place these deletes of objects/variables? In my experience most of the work and instantiation of variables happens in the (Browser)Views. These get called, then the templating engine generates the template but dynamically calls methods on the view to provide the data.

You might use intermediate variables/lists/objects to process and generate the data structures consumemed in the template, these you 'del' in the python code in the view methods. But the final data for the template needs to be returned.

nielssteenkrogh · April 23, 2020, 6:34pm

Good question.
Our main use-case for this is when having long-running import or export scripts (during migration to Python 5.2) using external methods.
We learnt our lessons when iterating 2.5 mill objects but still could do without any impact in RAM-usage.

We don't systematically delete in other places since there we don't see similar memory leaks. (I can see I was not so precise about this in my first message)

Still I think could be situations where needed.
Best way to find out is putting RAM and zodb cache logging into custom code you want inspect.

djay · April 24, 2020, 3:19am

To me its more likely you have a memory leak in something not ZEO cache related.
But to eliminate it you can use the zodb-cache-size-bytes setting instead of setting the cache size to 25000 just in case your data is very different sizes. You can also flush the cache to see how much memory that frees to see if its cache related.

You can also get a lot of info about what memory is being used without custom code by using the zope debug view which has refcounts of all the objects in RAM on that given thread. - https://www.parisozial-mg.de/Control_Panel/Products/OFSP/Help/Debug-Information_Debug.stx. Doesn't help tell which are the large objects though, just which have lots of refs.

fredvd · April 24, 2020, 10:10am

We have used the zodb-cache-size-bytes option for a while, but noticed that Plone sites using this would have random slowness in responding to requests when they started hitting the memory limit. So we removed it again, set for monitoring zope process memory usage instead and rotate restarting the zope processes after a few days/week depending on the memory usage trend.

Also see this thread on zodb-dev from 10 years ago. [ZODB-Dev] Understanding the ZODB cache-size option

This is partly low level object database technology, stable from 10-15 years ago. Not many devs/admins (still) know exactly what happens in the full stack . And there is a lot of YMMV depending on what you build on top and which objects are in your database. As Hanno wrote 10 years ago, it might be a case of black art.

Maybe zodb-cache-size-bytes has been fixed, improved or optimised, we're ZODB5 and Python 3 now. But unless somebody with detailed knowledge who worked on this layers during the last 5-6 years can give some more info, Hanno's post has for me been a red flag to avoid zodb-cache-size-bytes.

djay · April 28, 2020, 7:21am

I can't find the reference but I remember seeing something saying that the problem with zodb-cache-size-bytes had been fixed. Some googling hasn't turned up anything. We don't use it in production at the moment but we were considering it as a way to ensure our memory usage was more controlled.

Best I could find is

in 2012 (more discussion on if it works and details) https://sourceforge.net/p/plone/mailman/message/29172258/
someone using it in production last year - Slow restart of zeo/four instance big site. Repeated restarts are fast. Restart after 24hours plus is slow. Reason? but they use both limits so perhaps they never reach the bytes limit like you did
a bug fix in 2010 - https://bugs.launchpad.net/zodb/+bug/533015 but I'm not sure that explains what you saw.

Did you raise a bug about it? Its possible the code that works out what to evict is/was slow when the cache is large?

PeterB · April 28, 2020, 2:42pm

What we have checked in meantime is

our own Python Code. We have an external Method, that is called everytime, when the Homepage is called. We disabled the functionality, that used this so that it is not called anymore, but there was no difference in memory usage.
we have played with different numbers of cache size and flushing the cache manually. But also that had no influence in increasing memory usage.
All other customizings, that we have, is theming, some view customizations and page templates.

But this was all the same before we moved to ZEO. Before ZEO we had a performance problem due to increasing numbers of visitors to the site, but no memory problem.
Now we have good performance but hat memory problem.

So it looks also for me very much like a memory leak in the ZEO Clients code.

The memory usage curve increases parallel to the curve of number of visitors. It seems, that with every page hit of a visitor some more memory is assigned and but not released later on.
Only restarts reduce memory usage for a while.

The only thing that helps us to avoid crashes du to hitting the memory limit is rotating restarts of our 3 ZEO Clients every six hours - additionaly to the nightly restart of the whole instance.

Can any developper of the ZEO clients code have a look into that or is there any advice, what more to check?

nielssteenkrogh · April 28, 2020, 9:50pm

Hi Peter.
I checked in more detail what we changed for our Python3 version (where we with four zeo clients can run without any memory-leaks)

We changed portlets where we had memoize and always did lookout to site root for finding out if a portlet should be shown (now using simpler logic based on url of object) [We think our plone4.3 version had very slow restart because of same portlets]
In our code in general we removed nearly all use of aq_parent and nearly all use of indirect aq_parent (as example methods available from higher level of hierarchy of objects).
Removed nearly all use of portal_catalog

We have some parallel implementations of the same codebase of our plone4.3 system. (where not migrated to Python3 version yet).
Some with mono-instance and some with zeo. I think the pattern you describe also happens to us. (but not with as strong impact - we can restart our 64GB server zeo clients every 3-4 days).