Jenkins Node performance

petschki · October 21, 2024, 7:22am

During porting the robotframework tests I created a new experimental job pr-6.1-py3.12-robottest-only [Jenkins] which runs "robottests only" to get faster results of our changes.

Here are the runtimes on each Node for this job (ROBOTSUIT_PREFIX=ONLYROBOT results in 160 Scenarios for Plone 6.1)

Node 1: 9min 55sec -> success
Node 3: permanently offline
Node 4: 22min -> failure (timeouts)

I've therefore "pinned" the PR-6.1 jobs to Node 1 right now in order to get stable successful results but I wanted to know if the persons in charge of running Jenkins are aware of this?

/cc @gforcada @tisto @mauritsvanrees @fredvd @alert

EDIT:
I think that there might be a problem when there are running more jobs in parallel on the same node ... see Plone 6.1 - Python 3.10 - Robot Framework Tests (chrome) #384 Console [Jenkins] there it says Could not connect to the playwright process at port 46082.

and this one runniny in parallel succeeded Plone 6.1 - Python 3.12 - Robot Framework Tests (chrome) #622 [Jenkins]

mauritsvanrees · October 21, 2024, 8:15am

I am not aware of differences between the nodes, except that node 3 has indeed been down for a while now.
But I have noticed often that parallel runs on the same node can easily go wrong. So there is no full isolation. I try not to start many jobs at once, if I can avoid it.

Moving to running all tests in a Docker container could help, then they would surely be isolated. But that takes effort.
Or move everything over to GitHub, but that takes effort as well, rewiring all the nice things that we currently have with Jenkins and mr.roboto.
Or move all 100 or so plone packages to a mono repo, but this also takes effort.

alert · October 21, 2024, 8:44am

I restarted node 3.
Node 3 and node 4 are VMs hosted on machines that also have other things to do and they are also quite old, but last time I checked they were actually faster than node 1.
The situation might have changed.

petschki · October 21, 2024, 9:06am

Thanks for the info ... I've started the robottest builds once again sequentially and all is green now. I've also unpinned the 6.1 Jobs in order to let them choose on which node they are running ...

petschki · October 23, 2024, 4:49pm

Meanwhile Node3 and Node4 do have severe resource problems

alert · October 23, 2024, 7:47pm

Sorry abuot that but I do not really control the status of the VMs, just the host.
I anyway rebooted both of them and node3 is back online.

Node 4 is busy since minutes in this thing:

No other VM of that server is having any issue at all

petschki · October 23, 2024, 9:34pm

IIRC during the jenkins rfbrowser config session with @gforcada last week we saw a full disk on Node3 ... Node4 is maybe full too. Not sure about that but maybe the jobs need much more space on disk because of the multiple rfbrowser init commands.

gforcada · October 25, 2024, 7:55am

I think there is something not configured properly there, as every time a job with robot tests is run browsers get downloaded over and over again.

I remember configuring it to keep the browsers in a global folder, but maybe with the last changes that has to be updated.

I'm having problems ssh'ing into Node3 or Node4 as of late, and usually are rather slow (while doing apt-get update for example one can notice that right away).

1letter · October 25, 2024, 1:34pm

Confugure the Browsers global, then they must keep in sync with the current use robotframework and a ENV var is needed.

petschki · October 28, 2024, 2:29pm

@gforcada I can confirm that with the PLAYWRIGHT_BROWSERS_PATH environment var the browsers only get installed once. I additionally created a PR which only installs chromium for headlesschrome since we do not use any other browsers in robottests. See Only install chromium browser for robottests by petschki · Pull Request #374 · plone/jenkins.plone.org · GitHub

gforcada · October 28, 2024, 8:39pm

Thanks! Jenkins jobs are updated, please give them a try!

petschki · October 29, 2024, 10:02am

What's the scenario if we want to contribute an additional Node for Jenkins?

We have a IONOS dedicated Server in Germany which has some resources available (500G, 16Core) so we could setup a virtualized box there (libvirt).

I saw this package GitHub - plone/plone.jenkins_node: Ansible Galaxy Playbook for a jenkins node but I need some guidance there.