Upgrades in ZMI with Varnish load balancer don't work. Sticky issue?

Hi,

running an upgrade (ZMI > portal_setup > Upgrades) with Varnish load balancer, which is configured with more than 1 client, fails silently.

Plone 5.2.1
ZEO installation
4 Clients
Varnish vcl 4.0
Varnish load balancer in mode 'round robin'

Varnish vcl snippet:

vcl 4.0;
...
sub vcl_init {
    new balancer_0 = directors.round_robin();
    balancer_0.add_backend(backend_000);
    balancer_0.add_backend(backend_001);
    balancer_0.add_backend(backend_002);
    balancer_0.add_backend(backend_003);
}
...

Upgrades work using a single client in load balancer.

Is there a known kind of 'sticky' issue?

Thank you!

If you look at the zeoclient.log's from all backends, is none of the backends starting/logging the upgrade step outputs?

If you configure your backends like this in Varnish with a round robin director, you don't have any stickyness where the redirector will continue sending your requests to the same backend.

But OTOH this shouldn't be an issue: you send the command to start the upgrade for a profile in the ZMI to one of the backends, it performs the actions required and returns a response.

Only: if the time to run the upgrade step takes more time than any of the timeout values also configured in your Varnish or reverse proxy webserver, the connection is closed before your webbrowser receives the result HTML. But the upgrade should have been executed. If not, there is another issue, you can get it from the zeoclient.log and try to solve it.

For this reason I mostly use a separate ssh tunnel to connect to the ZMI of one zeoclient in such larger environment, to avoid any timeouts.

If you happen to use plone.recipe.varnish, it's configuration parameters for the timeouts are first-byte-timeout and between-bytes-timeout.

resp.: .first-byte-timeout and .between-bytes-timeout in your varnish.vcl: settings:

backend backend_000 {
   .host = "127.0.0.1";
   .port = "1234";
   .connect_timeout = 0.4s;
   .first_byte_timeout = 300s;
   .between_bytes_timeout  = 60s;
   .probe = backend_probe;
}```

Ok, I just had this effect on a Plone 6 live site with Traefic as load balancer in front. After I reduced the number of replicas to one (at night time and to workaround the problem for the upgrade only) upgrade steps run successfully.

This is really something to investigate - and at a first look I have no idea whats going on there. No errors have been logged in my Sentry.

Plone Foundation Code of Conduct