Is auto stop/restart of site still needed to prevent memory build up

rileydog · March 5, 2016, 6:54am

For my plone 3/4 sites I use to have, I had a crontab job that ran periodically to stop/restart site because Plone, especially 3, was a memory hog.
Is this practice still considered good practice for plone 5?

thanks

zopyx · March 5, 2016, 7:06am

Do it with any service when you have the need to do so. Any badly written application can cause such a behavior and it is not directly tied to the Plone core.

-aj

dieter · March 5, 2016, 8:19am

[C]Python's memory management does not use compaction. Any long running application with heavy non-formalized heap usage (such a Plone) without memory compaction tends to suffer from memory fragmentation and correspondingly increased memory usage. Things might have improved but likely you still will not be able to run your Plone processes for years.

jensens · March 5, 2016, 8:59am

I use supervisor and Memmon to restart in case of breaking a limit. It notifies me about the restart. I have one Plone 4 site only that restarts all 2 to 5 weeks (must be some addon) . All other sites are memory stable - depending on configuration - between 250 and 900MB each instance. So in practice restarts with cron are not needed in my opinion.

tkimnguyen · March 5, 2016, 2:12pm

The last time I saw a clear need to restart Plone on a schedule was with the 3.1 series. Since then I have seen very little process memory growth. However it probably is good practice to restart (say) weekly, or to monitor process size and resource usage anyhow.

hvelarde · March 7, 2016, 10:06pm

I'm not an expert on the topic but I agree with @dieter (that was the same explanation @leorochael gave to me some time ago).

we use memmon also and, in our experience, instances on small sites almost never restart while instances on large sites can be restarted many times a day.

the problem with big sites is that we need many instances running (because we use only one thread per instance) and we have a limited amount of memory available; in some cases we use 2GB per instance and 4 instance per machine (we also have Varnish and nginx installed) on 16GB VM.

lately I've been questioning if it's better to reduce the number of objects in the cache or even reduce the number of instances and increase the memory available per instance to avoid restarts at all.

jensens · March 7, 2016, 10:15pm

Thats now kind of new topic?

This really depends on. Also in this scenario the improved caching of RelStorage/memcached and the better speed of RelStorage with PostGreSQL may help - also to reduce the memory footprint a lot. Then it might be a good idea to not place all services on one machine, but segment into nginx/varnish/haproxy on one machine, plone and memcached on another one and PostGreSQL plus NFS on a third. We are using typically the "front" machine and the "db" machine shared for several sites while each "rendering" machine is on its own.

hvelarde · March 7, 2016, 10:41pm

we replaced that approach because it lead to a lot of administrative overhead; we're using 2 machines only and almost everything is redundant: we have one VM with nginx, Varnish, Plone and Zope running in ZRS master configuration and another VM with the same and Zope running in ZRS slave configuration.

we don't use memcached because I don't know it, but I have configured Varnish using the hash director in order to achieve more or less the same result: every instance handles different objects and the memory is better used.

we don't have a load balancer in front of this; we use instead CloudFlare DNS as a balancer, so we currently have only a single point of failure there and we only need to manage 2 machines instead of 6.

we were inspired by @gyst talk on High-performance high-availability Plone.

dieter · March 8, 2016, 7:38am

Using instances with a single thread is a good approach if you run on a multi-processor platform, your application is not IO- (but computational) dominated and you want to use the various processors for your request processing. This works around the Python limitation, that in one process only one thread can execute Python code at a time. It is not the optimal approach if memory is a limiting factor. A client of mine is using a setup with 6 worker threads per instance - and is quite satisfied (his application heavily interacts with Postgres - i.e. it likely is relative IO-bound).

dieter · March 8, 2016, 7:42am

The (ZMI) "Control_Panel" contains a view which gives hints towards the ZODB cache usage. If you see there very few ZODB reads, you may consider reducing the cache size.

jitesh43 · March 8, 2016, 8:49am

I would suggest, you should do a Zope Data.fs packing and flush the cache in a timely manner rather than restarting the the application by using a Cron job "As you said" . So that, the memory size will be reduce and delete all the historical data. (Instance/bin/zopepack)

hvelarde · March 8, 2016, 12:22pm

exactly, all of our machines have more than one processor and my main concern here is having a lot of instances ready to respond multiple requests concurrently.

our biggest sites are mainly media and government and most of the traffic that hits the backend is generated by robots re-indexing old content; I have to take a look later on why they are not obeying the metadata on sitemaps, but that's a different story.

on smaller sites we use only one instance running with 2 threads; they run on around 384MB of memory without almost no restarts using Plone 4.3.

yes, that's exactly my approach but unfortunately I have no easy way to monitor that; we implemented monitoring using New Relic at different levels, and we are pretty satisfied, but I don't know if it is possible to add the Zope counters to that easily.

hvelarde · March 8, 2016, 12:28pm

you don't want to pack the DB too frequently or you should lose the ability to undo things; we pack once a week, mainly early Saturday.

your approach on flushing the cache could be interesting if handled in a similar way on what memmon does: you want to do that when you reach a memory limit and not on a time basis. that would save some time/resources on not having to restart the instance indeed and that could be a good idea for a Superlance plugin.

smcmahon · March 8, 2016, 4:30pm

Let me just add to this thread that restarts can be an incredibly expensive operation for sites that have complex pages. Reloading enough objects to render the first page requested is probably the most time-consuming request any site handles. So, I generally prefer a strategy where restarts are rare.

When I have to restart a client (on a site with complex pages), I take it out of the load balancer's back end, restart it, request one or more key pages, and then return it to the load balancer. You can see a sample script in Plone/ansible-playbook: https://github.com/plone/ansible-playbook/blob/master/roles/restart_script/templates/restart_clients.sh.j2

djay · March 8, 2016, 11:23pm

Zope already has a memory limit option which garbage collects the cache when it hits a limit.

What would be nice is to have garbage collection happen even when you haven't got to the limit so that low traffic sites can drop less frequently used from their cache and save memory. Esp if it was more aggressive towards larger items.
Also keep in mind the ZODB cache is not the only thing in memory. Ram caching and memorize and even global variables and module imports all take up ram and are unaffected by zodb caching flushing.

hvelarde · March 10, 2016, 11:32pm

that script is pretty interesting; I spend some time trying to figure out how to achieve the same with my current configuration; seems that you can control the status of a backend in Varnish using something like this:

$ sudo varnishadm backend.list
Backend name                   Refs   Admin      Probe
instance1(127.0.0.1,,8081)     1      probe      Healthy 8/8
instance2(127.0.0.1,,8082)     1      probe      Healthy 8/8
instance3(127.0.0.1,,8083)     1      probe      Healthy 8/8
$ sudo varnishadm backend.set_health instance1 sick

$ sudo varnishadm backend.list
Backend name                   Refs   Admin      Probe
instance1(127.0.0.1,,8081)     1      sick       Healthy 8/8
instance2(127.0.0.1,,8082)     1      probe      Healthy 8/8
instance3(127.0.0.1,,8083)     1      probe      Healthy 8/8
$ sudo varnishadm backend.set_health instance1 auto

$ sudo varnishadm backend.list
Backend name                   Refs   Admin      Probe
instance1(127.0.0.1,,8081)     1      probe      Healthy 8/8
instance2(127.0.0.1,,8082)     1      probe      Healthy 8/8
instance3(127.0.0.1,,8083)     1      probe      Healthy 8/8

now I need to figure out how to implement this on a script, as Varnish and Plone are running under different users, but seems is only a matter of changing access permissions on the secret file.

hvelarde · March 10, 2016, 11:56pm

are you talking about the zodb-cache-size or something else that I don't know? zodb-cache-size doesn't work in the long run because of what @dieter explained earlier in the thread: Python's memory management and compaction.

memory consumption is asymptotic: depending on the number of objects, their size and total memory available, sooner or later you could end with a memory problem.

the only way to control that is reducing the number of objects in the ZODB cache.

yes, good reminder, I had forgotten that; I found this old article on Prevent and Debug Memory Leaks and suddenly I realized why sometimes I see huge lists of objects IDs on the request.

dieter · March 11, 2016, 9:54am

In fact, controlling the number of objects in the ZODB cache gives less control than "zodb-cache-size" as the size of individual objects hugely vary. Should you have a badly designed component, you might get huge persistent objects beside small ones.

Thus, controlling the number of objects neither affects fragmentation nor memory usage in a controlled way -- less controlled than using "zodb-cache-size".

"zodb-cache-size", too, has a problem. It uses the object's pickle size as an approximation for the object's memory size. However, the memory size can be much larger than the pickle size. Thus, control via "zodb-cache-size" is also not precise (but better than via the number of objects).

seanupton · March 14, 2016, 5:07pm

One more caveat of note: load balancers will generally not re-dispatch an existing, in progress request (due to fears about repeated requests lacking guarantee of idempotence).

HAproxy, for example, will serve a 503 error, not re-dispatch an existing request. Only new requests are re-dispatched (for users otherwise on sticky sessions). Take a form post for example; submit your form at just the moment that memmon is restarting the instance your session has been stuck to, and the average user loses their data/time/work. I had to work around this in my own forms application by implementing a "form save manager" in browser JS with AJAX posts and a retry-until-success UI.

Sean

hvelarde · March 17, 2016, 12:29pm

just an example illustrating what just happened on one of our sites: as you may already know we're in the middle of a huge political crisis here in Brazil; yesterday there was an event that caused a 10x spike in number of visitors and, at one point in the night, we had 10688 concurrent visitors.

the servers, two DigitalOcean, Ubuntu-based VM with 4 processors and 8GB RAM, running 4 Plone instances behind Varnish and nginx, and behind CloudFlare, where resisting stoically against all odds… and then, it happened:

I held my breath… and the site remained up.