Alternatives to health check zope instances via icp in load balancer

frisi · December 12, 2017, 5:24pm

thanks for your thoughts and input @hvelarde

you are right, icp is listening on udp only (tried with netcat -u localhost port) and therefore won't help a lot for haproxy health checks out of the box.

luckily you came up with a smart alternative for the aim of not needing a zserver thread for the health checks in your blogpost: five.z2monitor

i already considered to use it and port the Products.ZNagios and munin.zope based monitoring checks but forgot/oversaw that this can also be used for health-checks.
eagerly waiting for your 2nd blogpost hector so we can discuss/compare haproxy setups (pleas link it in this topic, too)

i saw that you are also trying to mark instances as down and up when restarting them with memmon. (https://github.com/Supervisor/superlance/issues/102)
i started to use memmon less and less because restarts tend to happen when the instances are neeed most (=under heavy load)
as the longest intervall is hourly (still very often), started to wrap memmon with a script that is run by a nightly cronjob

now i consider to completely replace memmon with https://github.com/plone/ansible-playbook/blob/master/roles/restart_script/templates/restart_if_hot.py.j2 as i need to define a cronjob anyway and this script can also handle haproxy up/down and warm up the zope instance:

background: on bigger sites it might take minutes for the zope instance to be ready to serve content. haproxy will see they are up (zope responds to localhost:8080/ in the health check, also your five.z2monitor probe will report "ok") but the first visitor's request to localhost:8080/Plone/ will take seconds/minutes to be served.

currently i need to do the following to restart a project's zope instances after an update/hotfix w/o downtime:

bin/supervisorctl restart instance1
ssh-tunnel to instance1 and call /Plone there to see if it's ready
repeat for instance2, 3, 4,...

i liked the idea of @smcmahon scripts that also warm up the instances after a restart by visiting a configurable set of urls.
(see https://github.com/plone/ansible-playbook/blob/master/roles/restart_script/templates/restart_clients.sh.j2)
i planned to port this to python using the requests api and install and configure it in my buildouts via zc.recipe.egg

iiuc this is where @Rotonen tries to improve things with sdnotify too, right?

of course - it would be great that all this is done out of the box with supervisor/memmon (and maybe systemd - i'm not just into the topic yet)

maybe we can agree on some best practices and join/coordinate efforts here instead of everyone doing his/her own thing.

use five.z2monitor for health checks. add the "OK" probe to this package or create another one (collective.haproxycheck?)
does the probe need to be smarter (eg. take care of warming up the instance)
use memmon (and add before/after restart scripts to supervisor) or restart_when_hot
(currently i'm in favor of restart_when_hot - see above)
startup instances (all at once) with supervisor and have a graceful_restart_instance(s) script
(still needed to restart multiple instances w/o downtime automatically, even if warming up is done in health-checks)

@smcmahon as you obviously dealed with similar problems and came up with your ansible scripts you might have some useful tipps or comments to share here.

@jensens as you've been using squid years ago (snowsprints ) and also do high-performance setups: maybe you can also share your knowledge on load-balancing (icp-server?) and managing multiple instances in this thread