Alternatives to health check zope instances via icp in load balancer

jone · December 13, 2017, 5:51pm

@frisi the supervisor eventlistener does only react to status changes of the supervisor programs. It does not really wait for the startsecs, but just notify HaProxy when the supervisor program status changes from STARTING to RUNNING.
So I think it could be combined with another supervisor eventlistener which knows how to verify if an instance is responsive and change the supervisor status accordingly.
But maybe this is just too much redirection..

hvelarde · December 18, 2017, 10:09pm

here is the second part including my haproxy.cfg for free:

I was not aware of that until you mentioned it; that could be indeed a problem but I can't test it right now.

anyway, the ok probe could be easily modified to support that also.

frisi · December 20, 2017, 6:34pm

thanks for the second post @hvelarde! will try this out and comment feedback if i have suggestions to further improve it.

@djay i created two probes for health checks in coll.monitor: https://github.com/collective/collective.monitor/pull/3
one simply returning OK, another one that can check if a database (default=main) is connected

djay · December 20, 2017, 9:09pm

Cool. We might extend this to only return ok if all dbs are connected because we do one db per site.

frisi · December 20, 2017, 9:29pm

if all your databases are served by the same zeo-server i don't see a good reason for one db being connected and another one not. if it is a configuration problem your instance won't boot. if it's a network or system thing it affects all databases.
adding a probe is fine, but i'd prefer a third probe (health_all_dbs_connected or os) rather than to loop over all databases by default (the health check should be as short as possible)

Rotonen · December 21, 2017, 2:55pm

FYI for the adventurous: I have a PR in on the sd_notify() stuff now: https://github.com/zopefoundation/ZServer/pull/8

smcmahon · March 13, 2018, 11:08pm

A late addition to this thread: the Plone Ansible Playbook now uses the method Hector describes (five.z2monitor plus haproxy's tcpcheck) for load-balancer health checks. Thanks to Hector for this great solution!

djay · March 14, 2018, 2:39am

We haven't done it yet but in the future we are going to switch to using different zeo-servers for each site, or at least group them. The reason being is that you start to get IO/GIL problems when using ZRS and a single zeo-server and 20+ databases. Not sure if future async changes will make that better.
@frisi The other thing I've witnessed is when using ZRS and during a failover that sometimes not all the connections reconnect properly leaving you some DBs giving errors. And thats on an instance by instance basis so it might be good to avoid those instances until they are fixed.

@smcmahon Looking at your PR I couldn't see where you added the monitor into the zope buildout. Hectors original suggestion used custom code because five.z2monitor is not enough by itself. Are you using collective.monitor?

djay · March 14, 2018, 2:50am

BTW, something to add to this thread. I believe haproxies default tcp health check is a connection check. see https://www.haproxy.com/documentation/aloha/7-0/traffic-management/lb-layer7/health-checks/ "Checking a TCP Port" -> "The check is valid when the server answers with a SYN/ACK packet."

I might be wrong but the conclusion from this is:

Hectors haproxy config doesn't make sense because there is no "expect" when using a TCP port check. All its checking is the monitor is up, not if its returning "ok". We will test this and report back.
You might not need five.z2monitor at all. Connecting directly to the zopes http port with a tcp-check might tell you enough. It will succeed at least when the zserver threads are all busy which is a plus. Not sure what it will do if the db is disconnected though.

djay · March 14, 2018, 3:25am

ok, further on it says in the docs it says "option tcp-check: Enables and allows tcp-check comment, connect/ send/ send-binary / expect sequences" so "expect" should work for tcp_check.
However point 2 still stands, you can get a decent LB behaviour just using the tcp port check on the normal zope server. The mistake to avoid is using http_check on the zope http port as that will make the service go down when its busy which isn't what you want.

hvelarde · March 14, 2018, 1:41pm

that's a interesting approach but could be harmful; remember this?

BTW, you're checking the wrong documentation; you must not use Aloha's, but HAProxy open source version:

http://cbonte.github.io/haproxy-dconv/1.8/configuration.html#4-option%20tcp-check

djay · March 15, 2018, 4:18am

if you are having network connection problems either its going to affect both the monitor and the http port it which case either healthcheck is as good. Or if affects the http port but not the monitor port... then the monitor healthcheck is going to say things a fine when its not instead of potentially rerouting traffic to a healthy server. This is the scenario that worries me the most. If there are cases of unhealthness that the monitor healthcheck don't cover.

smcmahon · March 16, 2018, 11:54pm

@smcmahon Looking at your PR I couldn’t see where you added the monitor into the zope buildout. Hectors original suggestion used custom code because five.z2monitor is not enough by itself. Are you using collective.monitor?

Check the haproxy role changes. I discovered that sending "quit" was adequate to find out if the monitor was up. No custom Zope code needed.