Re: [all] CI resources usage optimization

26 Feb 2025

      On Wed, Feb 26, 2025, at 1:57 AM, Sławek Kapłoński wrote:
...
Hi,
In of of the recent weekly TC meetings we have discussed usage of our 
CI resources by projects. After that discussion I did small script [1] 
to get some more data about it.
Snip
...
This email is of course not sent just to tell team to cut their testing 
coverage but if you can take a closer look at your project's CI jobs 
configuration, maybe there is some  way to improve it easily. For 
example if you have jobs running as non-voting for long time, maybe you 
can think of moving them to the experimental or periodic queue instead 
of running it in the check queue for every patch (or make it voting). 
This may be one of those small steps to optimize things a bit and make 
our own live easier as less load on infra means more stable jobs in 
general.
To expand a bit on this I'm going to use an example from Tacker, but I think that these issues aren't unique to that project. 

If we look at https://review.opendev.org/c/openstack/tacker/+/942337 we can see there are ~33 tacker-ft-* jobs that ran. Each of these uses 3 or 4 nodes. The jobs all failed with a RETRY_LIMIT error which means they were each attempted 3 times before Zuul gave up. This happens because there is a consistent failure in the jobs' pre-run playbook. In this case the issue appears to be tacker using some undefined Singleton object from oslo.service.

I think there are three different things we can do to improve the situation for tacker.

1) Stop running Devstack setup within pre-run. Pre-run playbooks should be used to set up the test environment in ways that aren't directly affected by the code under test. This is specifically to try once rather than three times when the project itself is broken. Some projects (like Nodepool) do run Devstack in pre-run. This is ok because I can make any change to Nodepool and it will not break Devstack. That isn't the case with Tacker (and probably others).

2) Consider combining some of these similar tacker-ft-* jobs into fewer jobs. If you look at successful runs of these jobs some of them appear to run with very similar configs then simply run a different set of test cases at the very end of the job. In those cases the total test case runtime if we ran test cases from multiple jobs in a single job is still much shorter than the setup runtime costs.

3) Consider whether each test really needs 4 nodes. Remember multinode testing is effectively a multiplier on the total cost of the job. We should use the bare minimum we can get away with. This has other upsides including making it easier for people to reproduce locally should they need to.

I think any one of these improvements would be a great benefit, but together the impact would be quite large. To illustrate this we currently use somewhere between 3 * 33 * 3 = 297 and 4 * 33 * 3 = 396 nodes for each run on this one change. If we implement 1) we get between 3 * 33 = 99 and 4 * 33 = 132 nodes. With 2) if we halve the number of jobs we get 3 * 17 * 3 = 153 to 4 * 17 * 3 = 204. With 3) if we get away with 3 nodes in each job then we get the floor of each of these ranges. Finally if we do some combo of 1) 2) and 3) we get 3 * 17 = 51 nodes. That is at least ~1/6th of the previous total resource consumption.

Re: [all] CI resources usage optimization

Clark Boylan