on every occurrence (in CI too) of corepack enable you replace it with npm i -g corepack@latest && corepack enable . Also images [need to be] updated after that issue are required (18.8.2 at least)
How to fix the corepack issue in the CI workflow
The corepack enable command is present e.g. in the projects .github/workflows/frontend.yml around line 50 and then needs to be temporarily adjusted as described.
corepack enable was not found nowhere else in the workflows.
I tried it out and can confirm : Frontend CI image creation and Manual deployment went smooth afterwards.
Run Plone locally with a Docker stack
Remark: There is also an occurence as corepack enable pnpm in the frontend/Dockerfile at the end. I did not need to modify this since I run Plone locally without a Docker stack and my global run of npm i -g corepack@latest && corepack enable did the job for now.
Guess: -> You may need to fix it in frontend/Dockerfile as well, if you use Docker locally!
@davisagli A big hug for the move! I have overseen some other locations as well. I include my error logs because I have not found a solution elsewhere until now.
I had this while trying to get the deployment to work on ARM and was not sure if it was platform related. The images build well, but the deployment was not finishing on the server. After rolling the same code out on AMD again the issue remains. There must have been still some bomb still ticking in the frontend.
My footgun was omitting to change line 32 in the frontend/Dockerfile
from corepack enable pnpm to npm i -g corepack@latest && corepack enable pnpm
In my case the Image creation went fine but the Manual deployment workflow did not work.
You get constantly restarting frontend containers with this footprint when visiting the logs with make stack-logs-frontend from the devops folder:
==> Stack my-plone-volto-project-com: Logs for frontend in context prod
...
my-volto-project-com_frontend.1.3u0sqs69nmwh@kwk | Command failed with signal "SIGTERM"
my-volto-project-com_frontend.1.4e3ajerpdxpm@kwk |
my-volto-project-com_frontend.1.4e3ajerpdxpm@kwk | > project-dev@1.0.0-alpha.0 start:prod /app
my-volto-project-com_frontend.1.4e3ajerpdxpm@kwk | > pnpm --filter @plone/volto start:prod
my-volto-project-com_frontend.1.4e3ajerpdxpm@kwk |
my-volto-project-com_frontend.1.4e3ajerpdxpm@kwk |
my-volto-project-com_frontend.1.4e3ajerpdxpm@kwk | > @plone/volto@18.8.2 start:prod /app/core/packages/volto
my-volto-project-com_frontend.1.4e3ajerpdxpm@kwk | > NODE_ENV=production node build/server.js
my-volto-project-com_frontend.1.4e3ajerpdxpm@kwk |
my-volto-project-com_frontend.1.4e3ajerpdxpm@kwk | API server (API_PATH) is set to: https://my.plone.volto.project.com
my-volto-project-com_frontend.1.4e3ajerpdxpm@kwk | Proxying API requests from https://my.plone.volto.project.com/++api++ to http://backend:8080/Plone
my-volto-project-com_frontend.1.4e3ajerpdxpm@kwk | 🎭 Volto started at 0.0.0.0:3000 🚀
my-volto-project-com_frontend.1.4e3ajerpdxpm@kwk | ELIFECYCLE Command failed.
my-volto-project-com_frontend.1.4e3ajerpdxpm@kwk | /app/core/packages/volto:
my-volto-project-com_frontend.1.4e3ajerpdxpm@kwk | ERR_PNPM_RECURSIVE_RUN_FIRST_FAIL @plone/volto@18.8.2 start:prod: `NODE_ENV=production node build/server.js`
my-volto-project-com_frontend.1.4e3ajerpdxpm@kwk | Command failed with signal "SIGTERM"
my-volto-project-com_frontend.2.v5wziasy1xyn@kwk |
...
## Note: this repeats forever for every reincarnation of the stalling container after ~60 seconds
See all changes in the PR in detail (not all may have affected your resulting project):
@acsr I'm on the same issue as well with a new website on the foundation docker swarm cluster, with a very recent cookieplone and volto 18.9.0. I've updated the corepack calls, but another suspect is the docker swarm healthcheck. The default is coming for the container image from plone-frontend its prod-config Dockerfile.
Now overriding it in my project Dockerfile and increasing the timeouts. Maybe pnpm is taking more than 30 seconds for something and then docker sees an unhealth container.
@fredvd Can you write the path of the defaults more explicit. I am not sure if I get this. Do you mean: project-title/frontend/Dockerfile or is prod-config Dockerfile a default deeper in the packages.
I am asking because I want to understand the settings to override.
Can you please specify how you "increase the timeouts"?
This seems to be clearly project-title/frontend/Dockerfile
Other timeouts?
In my frontend/Dockerfile there is no timeout.
My failure occurs after running the Manual deployment from Github Actions web UI.
There is line 49 in my project-title/.github/workflows/manual_deploy.yml deploy_timeout: 480 which is actually in minutes resulting in 8 hours as far as I understand
this is my Manual Deployment log from github adjacenting the deployment failure:
Update: more than 8h later:
The end of the Github Action logs for Deploy to Cluster in Manual Deployment were updated at the end -> finally replacing the state Error: This deployment will not complete with:
...
Deploy: Checking status
Service kwk-dev-acsr-de_backend state: replicating 0/2
Service kwk-dev-acsr-de_db state: deployed
Service kwk-dev-acsr-de_frontend state: deployed
Service kwk-dev-acsr-de_purger state: deployed
Service kwk-dev-acsr-de_traefik state: deployed
Service kwk-dev-acsr-de_varnish state: deployed
Service kwk-dev-acsr-de_backend state: deployed
Service kwk-dev-acsr-de_frontend state: replicating 0/2
Error: Timeout exceeded
Deploy: Failed
Can you please specify how you "increase the timeouts"?
Other timeouts?
In my frontend/Dockerfile there is no timeout.
There is some 'inheritance' in play here. When CI/CD builds the frontent image from the Dockerfile in your project definition, it 'inherits' or builds upon an already generated standard image. In the project I'm struggling with the same issue, that is done on this line:
Because of this FROM AS, it inherits the Dockerfile HEALTHCHECK statement from the built plone/server-prod-config:
Unless you override it again in our 'final' Project Dockerfile, which I did to test if it was a HEALTHCHECK timeout:
Now that I've increased the HEALTHCHECK timeout to one minute I have a bit of time to enter the container and try to inspect what is going on. There are a number of processes consuming CPU. But the healthcheck indeed fails, both the programmed one as a simple telnet:
@fredvd I am also visiting the same project due to preparing the content for that site. I found your HEALTHCHECK addition in the Dockerfile there, just some minutes ago but had no time to dig deeper there, because we have a related meeting this afternoon.
Is it worth to dig in here deeper or wait until you get a grip on that. I guess the origin of the pitfall is currently beyond my scope for now, except I end up the fool that digged the trap myself.
Does it make sense to reproduce your experimental change and see if I get the same effects? I am not sure where exactly to look for.
No, please wait. I'm on it now, this is sysadmin picking and poking to find out what is happening. I just realised maybe the localhost <> 127.0.0.1 can be anissue as well. I'll investigate.
I might set the healthcheck to 2-5 minutes so I have longer to peek around in the system.
So the problem with the tagung.plone.de was a really silly one. I thought about it when checking possible causes and things I changed last friday. And then forgot to really check it
The backend needs to be up and running, but it also needs to have a valid Plone-site configured. When the frontend starts up, the Volto SSR server checks the main (portal) root and that request HAS to respond with a 200.
If anything else is returned, or a connection refused, or a 404 not found (when no Plone site is there yet), you get the current behavior with the latest images. The pnpm process 'spins' and the container gets killed by (standard) container healthcheck permissions after X seconds. Rinse and repeat.
I keep telling to myself this afternoon that there used to be more debug output because I remember making the same mistake before and then recognised and fixed it... There is now no indication whatsoever. We also switched from yarn to pnpm, maybe yarn was more chatty. But I'm not so sure, perhaps I'm just getting old.
I would love to add a preflight check here to the frontend startup that it reports if the backend is available and that there is a site that returns 200 in the default container output. It will save future users a lot of searching.
@acsr one more thing: the basic auth has an issue, on it now.
@fredvd Gotcha! I had also this missing step once I killed the swarm and redeployed, the existing content including the site is gone (or not).
Why the Manual Deployment skips the create-site step when starting from scratch is worth a further look.
Without any other effort, a make stack-create-site from the devops folder created a new Plone site and the server is up an Plone is working.
Usually I write all the steps down and repeat them for every procedure I reuse.
When I start from the devops folder I always did:
make stack-deploy
make stack-create-site
But after the pnpm pitfall I started to do
make stack-deploy
make stack-status
and started to focus on the frontend pnpm issues, ignoring the step needed in the backend to create the site.
Later when the frontend pnpm issue was solved, I was stuck with this procedure. I had these restarting frontend containers before when I missed the create site step, but forgot to take notes in details because it seems obvious to never make the mistake again.
Manual Deployment now succeeded at once. I am still wondering why an initial Manual Deployment fails and still needs a make stack-create-site and the succeeds. Need to retry that on ARM as well.