If you look at the zeoclient.log's from all backends, is none of the backends starting/logging the upgrade step outputs?
If you configure your backends like this in Varnish with a round robin director, you don't have any stickyness where the redirector will continue sending your requests to the same backend.
But OTOH this shouldn't be an issue: you send the command to start the upgrade for a profile in the ZMI to one of the backends, it performs the actions required and returns a response.
Only: if the time to run the upgrade step takes more time than any of the timeout values also configured in your Varnish or reverse proxy webserver, the connection is closed before your webbrowser receives the result HTML. But the upgrade should have been executed. If not, there is another issue, you can get it from the zeoclient.log and try to solve it.
For this reason I mostly use a separate ssh tunnel to connect to the ZMI of one zeoclient in such larger environment, to avoid any timeouts.
If you happen to use plone.recipe.varnish, it's configuration parameters for the timeouts are first-byte-timeout and between-bytes-timeout.
resp.: .first-byte-timeout and .between-bytes-timeout in your varnish.vcl: settings:
Ok, I just had this effect on a Plone 6 live site with Traefic as load balancer in front. After I reduced the number of replicas to one (at night time and to workaround the problem for the upgrade only) upgrade steps run successfully.
This is really something to investigate - and at a first look I have no idea whats going on there. No errors have been logged in my Sentry.
Now that I see this issue pop up again on the forum.... Could it be that when the request takes too long, Varnish decides the first backend is down and retries the request on the second zeoclient, which then causes a zodb conflicterror and aborts both attempts on both zeoclients? Allthough that wouldn't be 'silent', I'd expect to see something in the logs.
I'm assuming it's something relating to CSRF. When I load the upgrades list screen an instance generates some validation code. When saved, another instance responds and does not validate the code generated by the previous instance.
That's all an assumption. I didn't investigate this.
The token (for details see package plone.protect) is stored using a persistent utility in ZODB (using plone.keyring). And it is used everywhere in the Plone-UI (on add/edit/....). Why should it fail in a ZMI form with multiple instances?
Btw.: I still have this effect in a 5.2-latest site (load balanced (Varnish/Traefik) and in a 6 site, also load balanced (only Traefik). It is annoying, but once I configure the cluster down to one instance, upgrades are running.
I took the following test. I started the instance, went to the steps listing page and noted the checkbox value. I restarted the instance, reloaded the listing and saw that the value attribute is different from the first time.
So each instance has a different value. When varnish lists the steps as an instance, and when saving sends it to another instance, the values don't match.