Upgrades in ZMI with Varnish load balancer don't work. Sticky issue?

maha · April 21, 2021, 3:20pm

Hi,

running an upgrade (ZMI > portal_setup > Upgrades) with Varnish load balancer, which is configured with more than 1 client, fails silently.

Plone 5.2.1
ZEO installation
4 Clients
Varnish vcl 4.0
Varnish load balancer in mode 'round robin'

Varnish vcl snippet:

vcl 4.0;
...
sub vcl_init {
    new balancer_0 = directors.round_robin();
    balancer_0.add_backend(backend_000);
    balancer_0.add_backend(backend_001);
    balancer_0.add_backend(backend_002);
    balancer_0.add_backend(backend_003);
}
...

Upgrades work using a single client in load balancer.

Is there a known kind of 'sticky' issue?

Thank you!

fredvd · April 22, 2021, 1:58pm

If you look at the zeoclient.log's from all backends, is none of the backends starting/logging the upgrade step outputs?

If you configure your backends like this in Varnish with a round robin director, you don't have any stickyness where the redirector will continue sending your requests to the same backend.

But OTOH this shouldn't be an issue: you send the command to start the upgrade for a profile in the ZMI to one of the backends, it performs the actions required and returns a response.

Only: if the time to run the upgrade step takes more time than any of the timeout values also configured in your Varnish or reverse proxy webserver, the connection is closed before your webbrowser receives the result HTML. But the upgrade should have been executed. If not, there is another issue, you can get it from the zeoclient.log and try to solve it.

For this reason I mostly use a separate ssh tunnel to connect to the ZMI of one zeoclient in such larger environment, to avoid any timeouts.

If you happen to use plone.recipe.varnish, it's configuration parameters for the timeouts are first-byte-timeout and between-bytes-timeout.

resp.: .first-byte-timeout and .between-bytes-timeout in your varnish.vcl: settings:

backend backend_000 {
   .host = "127.0.0.1";
   .port = "1234";
   .connect_timeout = 0.4s;
   .first_byte_timeout = 300s;
   .between_bytes_timeout  = 60s;
   .probe = backend_probe;
}```

jensens · April 27, 2021, 11:13pm

Ok, I just had this effect on a Plone 6 live site with Traefic as load balancer in front. After I reduced the number of replicas to one (at night time and to workaround the problem for the upgrade only) upgrade steps run successfully.

This is really something to investigate - and at a first look I have no idea whats going on there. No errors have been logged in my Sentry.

wesleybl · September 8, 2021, 6:58pm

Any news about this subject? I came across this issue in Plone 5.2.4

fredvd · September 8, 2021, 7:24pm

Now that I see this issue pop up again on the forum.... Could it be that when the request takes too long, Varnish decides the first backend is down and retries the request on the second zeoclient, which then causes a zodb conflicterror and aborts both attempts on both zeoclients? Allthough that wouldn't be 'silent', I'd expect to see something in the logs.

wesleybl · September 8, 2021, 8:09pm

The upgrade step doesn't take long. I don't see anything in the log.

wesleybl · September 8, 2021, 8:21pm

I'm assuming it's something relating to CSRF. When I load the upgrades list screen an instance generates some validation code. When saved, another instance responds and does not validate the code generated by the previous instance.

That's all an assumption. I didn't investigate this.

wesleybl · September 8, 2021, 8:26pm

The form has an input

<input name="_authenticator" type="hidden" value="dc8d377f3038018417010d02ad4f24e7c301f558">

This _authenticator must have some relationship to the instance, and one instance cannot validate that of another instance.

jensens · September 8, 2021, 8:29pm

No, this effect is there even with very simple and short upgrades.

jensens · September 8, 2021, 8:36pm

The token (for details see package plone.protect) is stored using a persistent utility in ZODB (using plone.keyring). And it is used everywhere in the Plone-UI (on add/edit/....). Why should it fail in a ZMI form with multiple instances?

Btw.: I still have this effect in a 5.2-latest site (load balanced (Varnish/Traefik) and in a 6 site, also load balanced (only Traefik). It is annoying, but once I configure the cluster down to one instance, upgrades are running.

wesleybl · September 8, 2021, 9:47pm

You're right. The problem is not the _authenticator. I changed it in the browser submitted the form. Then it appeared in the log:

zExceptions.Forbidden: Form authenticator is invalid.

Something that doesn't happen when we leave it as it comes.

yurj · September 9, 2021, 8:37am

Maybe local caching? Stale entries somewhere?

wesleybl · September 9, 2021, 12:51pm

I noticed that at each start of an instance, the value attribute of the checkbox of step receives a value:

<input type="checkbox" name="upgrades:list" value="6066838800268338841" checked="checked" id="6066838800268338841">

I took the following test. I started the instance, went to the steps listing page and noted the checkbox value. I restarted the instance, reloaded the listing and saw that the value attribute is different from the first time.

So each instance has a different value. When varnish lists the steps as an instance, and when saving sends it to another instance, the values don't match.