CI run speedups for buildout.coredev

I've been working on fiddling with getting our test runner wrapper from to run the buildout.coredev tests. Our original wrapper is by @jone and the current rewrite is mostly mine. Usability improvements and print safety of the current one are in large parts via feedback and contributions from @lukasgraf.

Our original problem was having very large long running layers, so we originally just split by module and mapped those onto the CPU cores and made sure each had a unique ZSERVER_PORT (and also used the Jenkins port allocator to allow for running multiple builds like that in parallel on the same CI server). The issue with the buildout.coredev tests is different, but the current wrapper does a half decent job at cutting down the wallclock runtime of those as well.


@thet can you give the 5.1 branch a timing run on your server again at this point? Install PhantomJS and run the tests niced.

I'm now AFAIK rather close to the truth on:

  1. Discovering all the layers (some layer names have spaces in them!) and tests
    1.1 Dexterity tests
    1.2 Archetype tests
    1.3 Doctests
    1.4 Robot tests
  2. Detecting if PhantomJS is installed
  3. Splitting large layers into multiple test run batches per test class
  4. Assigning a unique ZSERVER_PORT to all batches
    4.1 Already baked in is also support for the Jenkins port allocator plugin
  5. If something fails, giving the user the exact command line to copy paste and run (including single quoting names with spaces in them!)

And as a new thing: asserting on an unclean STDERR. This can get annoying if there are mysterious transient failures like recursion limits being hit and those flaring up as we've had on our opengever.core - we had to turn this off. It's picked up a few curiosities on 5.0 and 5.1, so I'd like for people to give this a spin and look if those are meaningful or not.

It also commits swift and total infanticide upon ^C. Thank you @jone for that one.

What is still missing is figuring out the required XML output for Jenkins to parse. @gforcada would you know where that stuff is defined (flushed to where and picked up on what condition)?

For giving it a spin, I recommend the branch based on 5.1 as STDERR on 5.2 is noisy per deprecation warnings (as is to be expected at this point). Check out the roto-testrunner-wrapper-51 branch and run bin/mtest. There is also a -d flag for more verbose output (including runtime stats per batch) and a -j flag for controlling the parallelity.

Current strategy:
time (export ROBOT_BROWSER='phantomjs' && bin/alltests --all; bin/alltests-at --all)

Parallel via zope.testrunner:
time (export ROBOT_BROWSER='phantomjs' && export NPROC="$(getconf _NPROCESSORS_ONLN)" && bin/alltests --all -j"$NPROC"; bin/alltests-at --all -j"$NPROC")

Normal mtest:

Debug mtest with monotonic logging and streaming debug output:
bin/mtest -d

So how, why, what? bin/alltests runners run things in groups and the -j flag on zope.testrunner only splits by layer, not within layers. This means most of the time we'd be running 1..3 layers in parallel while the group runs and most groups only have one layer in them. So this works by trading that idle CPU time for wasted duplicated CPU time and gaining a wallclock time runtime improvement in return. The only blocker to running all of the layers in parallel from each other seemed to be the ZSERVER_PORT allocation.

If one runs all 4 variants from above for comparison, one can see the ratio of wallclock to CPU time visualizing the core concept at play.

I'd eventually like to bring parallelization improvements in some form upstream to zope.testrunner, but that's a rather steep uphill at this point still.

Pain points:

  1. Layer setup duplication is a lot of unnecessary CPU time
    1.1 There are many ways around this, but all require further exploration
  2. The test discovery parsing from subprocess output is an ugly idea, but got me started and this far
  3. The log timestamps on the normal run are not monotonic as per how the logger flushes

It'd seem even as it is now, having CI runs complete in under 10 minutes should be rather possible, but there is still some work to be done to get to even that point. Feedback on the current status most welcome.


I'm starting to think I'm not yet done with the test discovery. My wrapper discovers 132 unique test layers. I was also overwriting the test classes list on a duplicate layer name, which is now fixed on the 5.1 branch.

time (bin/alltests --all --list-tests; bin/alltests-at --all --list-tests) | grep -E '^Listing' | sort -u | wc -l

real	2m9,161s
user	1m9,805s
sys	0m58,832s

Which seems superficially ok, as running the test discoveries separately and filtering for unique layers yields the same count (and layers, for that matter).

time bin/alltests --all --list-tests | grep -E '^Listing' | sort -u | wc -l

real	0m57,124s
user	0m29,550s
sys	0m27,366s
time bin/alltests-at --all --list-tests | grep -E '^Listing' | sort -u | wc -l

real	1m8,522s
user	0m38,143s
sys	0m30,154s

Some of these layers are split between DX and AT tests in the way we currently run the tests. None of them seem to fail when run together, but is there a traditional reason for having run them separate? Layer leakage issues?

time bin/test --all --list-tests | grep -E '^Listing' | sort -u | wc -l

real	0m12,826s
user	0m8,843s
sys	0m3,950s

What really catches my eye is bin/test finding 5 more layers in ~1/5 of the runtime.

bin/test --all --list-tests | grep -Po '(?<=^Listing ).*(?= tests:)' | sort -u > tests.txt
(bin/alltests --all --list-tests; bin/alltests-at --all --list-tests) | grep -Po '(?<=^Listing ).*(?= tests:)' | sort -u > alltests.txt
diff -u alltests.txt tests.txt
--- alltests.txt	2018-03-31 14:23:06.000000000 +0200
+++ tests.txt	2018-03-31 14:20:42.000000000 +0200
@@ -1,5 +1,9 @@
@@ -129,4 +133,5 @@

These seem legit and just taking the first one as an example, do seem to contain tests we're as far as I can see currently not running.

Listing Products.CMFCore.testing.EventZCMLLayer tests:
  test_object_indexed_after_adding (Products.CMFCore.tests.test_CMFCatalogAware.CMFCatalogAware_CopySupport_Tests)
  test_object_indexed_after_copy_and_pasting (Products.CMFCore.tests.test_CMFCatalogAware.CMFCatalogAware_CopySupport_Tests)
  test_object_reindexed_after_cut_and_paste (Products.CMFCore.tests.test_CMFCatalogAware.CMFCatalogAware_CopySupport_Tests)
  test_object_reindexed_after_moving (Products.CMFCore.tests.test_CMFCatalogAware.CMFCatalogAware_CopySupport_Tests)
  test_object_unindexed_after_removing (Products.CMFCore.tests.test_CMFCatalogAware.CMFCatalogAware_CopySupport_Tests)

Currently seeing if a simple ROBOT_BROWSER='phantomjs' bin/test --all -j"$(getconf _NPROCESSORS_ONLN)" would do the trick without the need for a more complex wrapper.

Rebased onto current 5.1 and seems some recent changes have broken something - cannot get tests running succesfully by any method. Will thus resume this attack on the test runtimes most likely in July.

Thanks for your efforts!!

I will try to actually understand it if time allows, I'm quite swamped right now :worried:

@Rotonen I just wanted to install the buildout on the server and run the tests again, but currently buildout.coredev is broken because of some package of the test environment which wants to pull Zope4 in:

  Installing robot.
Error: There is a version conflict.
We already have: Zope2 2.13.27
but Products.PluginRegistry 1.5 requires 'Zope2>=4.0a3'.
but five.globalrequest 99.1 requires 'Zope2>=4.0a3'.

I don't feel like fixing this ATM.
I have to postpone the test run.

We should probably schedule a call at some point in any case as I'd like to help with the whole CI stack more.

That's music to my ears :musical_note: Let's talk by mail :email:

Buildout issues were fixed magically by time.

These are my test results with phantomjs installed and used:

real	11m58.971s
user	143m55.820s
sys	11m53.728s

There are a number of tests failing due to archetypes not installed, screenshots not being made and so forth. But not many.

Thank you, nice to have data.

There is one large batch of Robot tests which is limiting the runtime gain.

Any archetypes-issues are new to me. Do you have example test output?

You can run with Chrome (or Firefox) by installing:

It will automatically pick up and use, in order: chromedriver, geckodriver, phantomjs. You should also install xvfb.

I keep force pushing on the 5.1 based branch, so which revision did you run on? I've gotten a sub-8min runtime on a 12 core machine on the latest one yesterday. If it splits large layers into numbered batches and ran those testcount-wise largest to smallest, you're up to date.

PhantomJS might leave behind a lot of processes, you should double check and killall phantomjs before you hit an issue with that.

I'm also considering an --all flag for it so people could run it without the slow and flaky Robot tests as well.

1 Like

So, I think I've managed to get to a point of having a useful PoC. Next up would be cleanup and actually trying to use it on the 5.2 branch.

Summary of what and how:

  • Discovery of tests via a subprocess
  • Chunking into testsuite - group - layer - testclasses (testcounts)
  • Recursively zipper-unweaving sorted-by-testcount large layers
    • Aiming for a chunk size of 256 tests per chunk
      • Robot tests into chunks of 16
  • Running these via bin/test in a multiprocessing.pool

Reflecting on how we're currently running our tests, I've made it have --at, --dx and --robot flags out of which you have to pick at least one. Thank you @gforcada for access to the Jenkins to take a peek, this helped me out a lot.

I've done test runs on the new Digital Ocean 'high CPU' instances. It seems they actually randomly provision you an instance backed by a 4th generation Core Xeon or a 8th generation Core Xeon. The logs thus also provide a nice demonstration of generational gains (~15% .. ~20%, as expected). Do note a 'vCPU' is actually a potentially shared hyperthread, so I'd say 4 vCPUs match one real CPU core in linear performance and it's very hard to tell for parallel performance.

Key takeaways:

  • The way the alltests wrappers parallelize only per layer within a group limits the gain
    • This is a gain we could have now, though, and IMO should also go for
      • For DX tests only, as only they have a group with many layers
  • The alltests wrappers do not allocate ZServer ports when running tests in parallel
    • Robot tests cannot be parallelized, they will all fail as they try to allocate the same port
  • I did not manage to get chromedriver to work for me
    • It kept zombie stating, after which all successive test runs failed
  • Firefoxdriver makes the tests flaky (as they currently are), but at least runs the tests reliably
  • With 1 .. 2 threads the inherent penalty of redoing the test discovery and layer setup so many times 'unnecessarily' makes the current strategy faster
  • With 4 .. 6 threads the approaches are basically neck to neck
  • My wrapper has essentially linear gains until 32 threads in parallel
    • After this it starts to be limited by the longest running layer chunk

If anyone else wants to give this a spin, the roto-testrunner-wrapper-51 branch has bin/mtest and it'll tell you how to run it.

I also have the benchmark script and a Debian server setup script on that branch for convenience for now:

For whatever reason the benchmark script (and sometimes mtest itself) stalls on the -j 1 runs on macOS - ideas and feedback welcome on that one.

@thet can you run the bencmark script for me on your hardware and provide the log? It takes some 3h .. 5h to run so remember to nice the run. It's also not very nice to kill as it does not propagate ^C through all too well.

@tisto you seem to have the knowhow and a sprint coming up soon - could you take a stab at sanitizing how the Robot tests run? Currently it wastes a lot of time by shutting down the browser all the time.


You can control the ports zserver runs on and what port robot connects to. The snippet below is part of my setup using docker on circleci which is mostly not relevant here (it's not parallel either) but it shows how you can control the ports using env variables. I think setting ROBOT_PLONE_URL, ZSERVER_HOST and ZSERVER_PORT should do what you need?

version: '3'
    image: selenium/standalone-chrome
      - /dev/shm:/dev/shm
    privileged: true

    image: robot_tests
      - ROBOT_REMOTE_URL=http://selenium:4444/wd/hub
      - ROBOT_PLONE_URL=http://plominotest:55001/plone
      - /dev/shm:/dev/shm
    command: bin/test --all -vv
      - "selenium"

    image: robot_tests
       - selenium
    command: bin/robot-server --no-reload -v Products.CMFPlomino.testing.PRODUCTS_CMFPLOMINO_ACCEPTANCE_TESTING
      - /dev/shm:/dev/shm
    privileged: true
      - ZSERVER_HOST=robot-server
      - ZSERVER_PORT=55001
      - LISTENER_PORT=49999
      - LISTENER_HOST=robot-server
    image: robot_tests
       - selenium
       - robot-server
       - LISTENER_PORT=49999
       - LISTENER_HOST=robot-server
    command: bin/robot --outputdir=/buildout/parts/test --variable=REMOTE_URL:http://selenium:4444/wd/hub --variable=PLONE_URL:http://robot-server:55001/plone  /buildout/src/Products/CMFPlomino/tests/robot/test_*.robot
      - /dev/shm:/dev/shm
# doesn't work with remote docker
      - ./test/testoutput:/buildout/parts/test
      - ./src/Products/CMFPlomino/tests/:/buildout/src/Products/CMFPlomino/tests
    privileged: true

the selenium grid stuff might be of interest too or at least using docker for the selenium part. Its way easier than getting drivers etc working. The selenium grid I think it can create a bunch of browsers to be used as a shared resource when running parallel tests. I'm not running anything a parallel so didn't need the added memory so switched the standalone image above but the config below worked fine.

#  hub:
#    image: elgalu/selenium:2
#    ports:
#      - 4444:4444
#      # We need a fixed port range to expose VNC
#      # due to a bug in Docker for Mac beta (1.12)
#      #
#      - ${VNC_FROM_PORT-40650}-${VNC_TO_PORT-40700}:${VNC_FROM_PORT-40650}-${VNC_TO_PORT-40700}
#      # e.g. (hard-coded)
#      # - 40650-40700:40650-40700
#    volumes:
#      - /dev/shm:/dev/shm
#    privileged: true
#    environment:
#      - SELENIUM_HUB_HOST=hub
#      - SELENIUM_HUB_PORT=4444
#      - GRID=true
#      - CHROME=false
#      - FIREFOX=false
#      - MOCK_SERVER_HOST=mock
#  chrome:
#    image: elgalu/selenium:2
#    depends_on:
#      - hub
#    volumes:
#      - /dev/shm:/dev/shm
#    privileged: true
#    environment:
#      - SELENIUM_HUB_HOST=hub
#      - SELENIUM_HUB_PORT=4444
#      - VNC_FROM_PORT=${VNC_FROM_PORT-40650}
#      - VNC_TO_PORT=${VNC_TO_PORT-40700}
#      - SCREEN_WIDTH=1300
#      - SCREEN_HEIGHT=999
#      - VIDEO=${VIDEO-false}
#      - GRID=false
#      - CHROME=true
#      - FIREFOX=false
#  firefox:
#    image: elgalu/selenium:2
#    depends_on:
#      - hub
#    volumes:
#      - /dev/shm:/dev/shm
#    privileged: true
#    environment:
#      - SELENIUM_HUB_HOST=hub
#      - SELENIUM_HUB_PORT=4444
#      - VNC_FROM_PORT=${VNC_FROM_PORT-40650}
#      - VNC_TO_PORT=${VNC_TO_PORT-40700}
#      - SCREEN_WIDTH=1300
#      - SCREEN_HEIGHT=999
#      - VIDEO=${VIDEO-false}
#      - GRID=false
#      - CHROME=false
#      - FIREFOX=true

Which is exactly what I'm doing per batch in the mtest wrapper.

Which is why one can successfully run bin/mtest --robot -j 16 and running via the alltests wrapper wil fail as it does not set the port.

ROBOTSUITE_PREFIX=ONLYROBOT ROBOT_BROWSER=firefox xvfb-run bin/alltests --xml --all -t ONLYROBOT -j 16

The Robot related variables are set based on the ZServer port so they do not need to be explicitly set. If one needs to have more than one test run in parallel on the same machine, one can use the Jenkins port allocator to achieve this.

Or alternatively one could make use of the fact localhost is and just have the host set per run.

I do not see a benefit in this as you still have to setup a working version-pinned environment irrespective of whether or not it is within a container or not. As for the 'grid', we already have multiple Jenkins jobs and might as well just add a new one on top for testing against a different WebDriver if desired as I've gotten the Robot test runs down to the sub-3min regime (with potential for further gains if the tests get done properly).

Sorry I must have misunderstood you then. you seemed to be saying you couldn't run the robot tests in parallel and I didn't see where you were setting the ROBOT_PLONE_URL which is the bit you need to tell robot to use the new zserver port. but if its all working then just ignore me.
In addition by using the ROBOT_REMOTE_URL and grid stuff below you can use the same port for selenium and it just assigns browsers to each test. I think the default standalone uses a fixed port which could be a problem otherwise.

not for plone but for using selenium via docker is way easier. + you can run different browsers really easily. inc IE. no reason why you can't have plone outside the container and just use the remote connection to selenium. The idea is you seperate the env running the browser from that running the tests.

my thought was it could help with your startup and shutdown issue with browsers you mentioned but I'm not sure if it does or doesn;t. The docs imply it might but would have to test it.

"In addition, one can have a grid of all the same browser, type and version. For instance, one could have a grid of 4 machines each running 3 instances of Firefox 12, allowing for a ‘server-farm’ (in a sense) of available Firefox instances. When the suite runs, each test is passed to Selenium-Grid which assigns the test to the next available Firefox instance. In this manner one gets test pass where conceivably 12 tests are all running at the same time in parallel, significantly reducing the time required to complete a test pass."

No worries. I'm trying to convey a complex nuanced mess in a non-native language. I try to expect this with patience.

As far as I can see ZSERVER_PORT is used in constructing this before it gets set, so minimally I only have to set that (or the host) to have uniques per batch. I've opted to use ports rather than as the former is better supported by CI tools (and also works trivially as an approach across a wider range of systems).

The tests seem to explicitly close the browser at the end of each test.

And as for changing the testing infrastructure in regards to Selenium, the feedback and experience you share is most welcome, thank you. It'll not be a short road to change any of that. For now I'm tackling the lowest hanging fruit (from my point of view). I will get back to this later in the year.

I'll try to get this way of running the tests into a mergeable state just before or during Sauna Sprint in July.

Before that I'll try to figure out how to get the alltests wrapper parallelization gain for current DX runs. Likewise someone better versed in the art of Selenium can tackle the Robot tests. These alone could take 20min to 30min off the test runtimes.


I did a stupid

And suddenly

Key takeaways:

  1. There is a measurable impact on ordering the test layers for parallelisation
  2. We need to split the slow layers

So, I guess I'll have to polish up the mtest script(s) to get us to the 10min builds.

Preemptive feedback welcome on sore points.

And after adding an another layer of complexity, namely rejoining and then resplitting the split fluff layers into larger sequential runs to deduplicate the layer setups:

2018-10-14 13:16:38,753 - INFO - No failed tests.
2018-10-14 13:16:38,753 - INFO - Ran 8521 tests.
2018-10-14 13:16:38,753 - INFO - Aggregate runtime 1 hour 22 minutes 9 seconds.
2018-10-14 13:16:38,753 - INFO - Wallclock runtime 8 minutes 52 seconds.

This is on Python 3.7.0 on the current 5.2 branch, and has run all the tests, including Robot tests. With a bit more in the way of gardening the exact order and batch splitting one could probably golf that down to about 6 minutes, but this should be good enough as a proof of concept at this point. With redoing end to end tests in a sane way to allow for not closing the browser all the time, we could actually get to a sub 5 minute ballpark, but that's an another nuanced battle on its own.

Now, the hard bit is cleaning up the spaghetti into a state we could actually use this. It'll work with Python 2 builds as well, since it's just subprocessing bin/test, but does need Python 3 to run.

Download to your bin/ and give it a spin.


And many thanks to everyone in the Halle sprint whom killed enough test, test class and layer leaks to enable the liberation of the test run order to get to a point where I can pull this off. Especially to @pbauer who put up with my demand of doing just bin/test --all for the Python 3 tests and leaving the isolation failure grouping of the bin/alltests wrappers behind.

Before the sprint everything majorly fell apart when run even slightly out of sequence of those hand gardened groupings.

But do not worry, after I get the builds quick, I plan on resuming on shaking things hard with nightly and weekly builds to tease out more of the isolation issues I still know are about.


Plone Foundation Code of Conduct