Slow restart of zeo/four instance big site. Repeated restarts are fast. Restart after 24hours plus is slow. Reason?

nielssteenkrogh · June 3, 2019, 8:13pm

We have a 50GB plone 4.3, zeo with 4 clients.
The time from restart until site available varies from 10 minutes to 40 minutes.
A repeated restart is fast. (repeated means restart within up to 12-24 hours)

A restart after 24 hours is slow
The pattern is not fully clear to us.

We can see a RAM cache at level 5-10 GB per client is happening before site available.
Building the RAM is very fast when we do a high frequent restart.

We are used to this pattern so is not big issue.
But could be nice understand what impacts a fast/slow restart.

zopyx · June 3, 2019, 8:15pm

What do you mean with slow? The startup phase until "Ready to serve requests" or the time for serving the first request from a cold ZEO client

zopyx · June 3, 2019, 8:18pm

A larger university site that I am currently working on requires about 10-15 seconds - cached or uncached.
As part of a migration process we moved parts of the site to NFS (due to IT constraints) and the startup time dropped down to 2-3 minutes but I have never seen something slower like your observation of 10-40 minutes.
This appears weird.

nielssteenkrogh · June 3, 2019, 8:35pm

Is minutes until request visible with result in browser. (we don't restart zeo)
We have this pattern on more than one site (faster and slower restart based on time between restarts).
If we do manage_main (going zope without plone from browser) it can serve nearly instant.
We have idea it can be related to caching of portlets - but no clear idea.
And especially not sure why the time between restarts can influence.

We have a virtual setup and have a snapshot of our production system so we can experiment with settings.
If you have any idea what to look into.
As example could change in buildout settings if you think could impact.

extends = production.cfg

parts +=
zeo
client1
client2
client3
client4

[zeo]
recipe = plone.recipe.zeoserver
zeo-address = 8100
effective-user = zitelab
blob-storage = ${buildout:directory}/var/blobstorage

[client1]
<= instance

zodb-cache-size = 500000
zodb-cache-size-bytes = 3000MB
zeo-client-cache-size = 0MB
zserver-threads = 2
zeo-client = true
zeo-address = ${zeo:zeo-address}
http-address = 9673
blob-storage = ${zeo:blob-storage}
shared-blob = on

Put the log, pid, lock files in var/client1

event-log = ${buildout:directory}/var/client1/event.log
z2-log = ${buildout:directory}/var/client1/Z2.log
pid-file = ${buildout:directory}/var/client1/client1.pid
lock-file = ${buildout:directory}/var/client1/client1.lock

djay · June 4, 2019, 4:33am

One possibility is not having a queryplan set. But doesn't explain why restarts are quick some of the time. We've seen very large RAM and time on the first complex query when the queryplan is not optimised and preloaded. The process to put in a queryplan for when plone starts up is a bit painful. I've created collective.autoqueryplan to help with this but still in the process of testing it.

fredvd · June 4, 2019, 8:16am

@nielssteenkrogh Are you aware of the zeoclient cache and how it syncs up with the zeoserver when it is stale by checking the las transaction id's. If the two are too much out of sync (too much changed objects) the zeoclient cache is not synced but removed. There are some settings for this but I don't exactly remember them.

What you describe with the quick startup time if the shutdown was recently (within a day ) could be a symptom of this.
Then a first startup can take a long time because al objects have to come from the zeoserver.

nielssteenkrogh · June 4, 2019, 2:04pm

No ... not aware of this. Could be reason. What we observe is same Level of RAM as starting point after restart. So as you describe maybe the objects has to come in different way when last restart long time ago. If you have any pointers to what settings impacting this in buildout let me know.

jaroel · June 4, 2019, 2:29pm

When I used persistent zeo client caches, I always used the zeo-client-drop-cache-rather-verify option as my zeo server was either on the same machine or one physically above/below it - aka fast enough network.
Later I started using non-persistent zeo client caches, so I never ran into the verification process.

zopyx · June 4, 2019, 4:48pm

yurj · June 6, 2019, 8:03am

check if the index is ok. It happens also here in one istance and I suspect the problem is in the zeo index. Are you using extentions related to filesystem access?

nielssteenkrogh · June 6, 2019, 8:29am

Its very nice with this answer and also other answers. If you in your answer also can add few words how we can check your idea (if your idea for reason is correct) it's much more easy answer back.
We are internally of course looking into all answers - and will have an internal sprint in 10 days where we try do the inspections and changes you suggest - so we can try things out and put a structured document here with our investigations and results.
Of course all answers nice also without all details from you.
Thanks a lot!

yurj · June 6, 2019, 9:14am

I don't know the solution. But long times in startup happen when you've a big Data.fs file and you stop the zeo server, delete Data.fs.index and start zeo. Zeo will recreate the Data.fs.index. For large databases it will take long.

http://plone.293351.n2.nabble.com/Data-fs-index-doesn-t-get-updated-on-shutdown-td7560333.html

Mmm, without knowing much I put some logger.warn into

ZODB.FileStorage.FileStorage.init
ZODB.FileStorage.FileStorage.close
ZODB.FileStorage.FileStorage._save_index

Or use a debug tool to see what zeo is doing at startup.

nielssteenkrogh · June 6, 2019, 10:01am

Thanks for details.
We don't restart zeo and are not deleting index.
But maybe some of "similar" processes happening like if we did - will investigate using your logger.warn idea when we "sprint"

fredvd · June 7, 2019, 12:25pm

@nielssteenkrogh Another cause for zeo clients taking a long time to start up is a very large/fragmented portal_catalog. It is stored in btrees and these can become unbalanced over time when new content is added.

Hanno Schlichting & Helge Tesdal wrote a script which rebalances the btrees in all the portal_catalog indexes. There's a bit of heuristics in there and an empirical balancing parameter. Use at your own risk, but we have tested in on some sites and afterwards also on production, where the startup time of zeoclient (up to delivering their first response). went down considerably (>50%) after running it.

https://raw.githubusercontent.com/hannosch/scripts/master/catalogoptimize.py

The script seems safe to run on online zodb's, it uses subtransactions to commit the changes to the zodb. I did see some conflicterrors when running it on a live site, I don't know if it then skips the whole index of if it retries later.

What I don't understand is what this rebalancing script brings compared to a full 'Clear and rebuild' of the whole portal_catalog. It seems to look at the current state of an index before it rebalances it, a "Clear and rebuild" will drop the indexes and fill them up again while walking over all content in the site. Which might be just as inbalanced as the organic growth of content.

If anybody more knowledgable about the portal_catalog can shed a light on this....

Allthough it speeds up startup time, it doesn't really match your symptom of only have slow startups of zeoclients after being offline for more than 24 hours, that points more towards the zeo-client-cache. Allthough a huge portal_catalog might also get synced to the zeo-client cache, so decreasing this data structure could be beneficial to that as well.

nielssteenkrogh · December 25, 2019, 12:59pm

HI, now in python3 land with a migrated plone5.2 dexterity from old plone4.3 archetypes.
The slow restart GONE! (is a 20 GB Data.fs!)
Running wsgi (no zeo) and changed portlets with different logic so no memoize needed.
Now restart is not dependent on size of zodb.

We have eight forks for wsgi zope conf with those parameters:
# Main database
cache-size 5000000
cache-size-bytes 3000MB

The high values needed there since wsgi forks each take its part (this is different to ZEO clients where each of clients have full from those parameters).

So right now no problems like before with slow restarts.
Fantastic!

For our original problem we did not find any solution in Plone 4.3 - so kept things running there without changes. During migration we had a strong focus on things we guessed could impact.
The migration itself was done with MySQL export from zodb. Then external methods creating content in the new Data.fs loading data from MySQL.
The users we migrated with Json as the export format. (so users could login without problems on the new site after migration)
All export and imports done with custom external methods.
We migrated like 1.5 million objects with origin from Archetypes.
We decided to take this approach in migration so we had a fully clean start after running same Data.fs since version 2 of Plone with many migrations.

nielssteenkrogh · January 1, 2020, 9:55am

Our learning continued. We ended up extending with Zeo again - so now running a combi of wsgi and zeo (four zeo clients).
So we changed also our #Main database cache settings
# Main database
cache-size 500000
cache-size-bytes 300MB

Still the restarts are unchanged fast although we have like 1.5 mill dexterity objects.
So we start the New Year without any of our old problems with slow restarts