On 2/4/21 12:28 PM, Dan Smith wrote:
Hi all,
I have become increasingly concerned with CI performance lately, and have been raising those concerns with various people. Most specifically, I'm worried about our turnaround time or "time to get a result", which has been creeping up lately. Right after the beginning of the year, we had a really bad week where the turnaround time was well over 24 hours. That means if you submit a patch on Tuesday afternoon, you might not get a test result until Thursday. That is, IMHO, a real problem and massively hurts our ability to quickly merge priority fixes as well as just general velocity and morale. If people won't review my code until they see a +1 from Zuul, and that is two days after I submitted it, that's bad.
Thanks for raising the issue Dan, I've definitely been hit by this issue myself.
Now, obviously nobody wants to run fewer tests on patches before they land, and I'm not really suggesting that we take that approach necessarily. However, I think there are probably a lot of places that we can cut down the amount of *work* we do. Some ways to do this are:
1. Evaluate whether or not you need to run all of tempest on two configurations of a devstack on each patch. Maybe having a stripped-down tempest (like just smoke) to run on unique configs, or even specific tests. 2. Revisit your "irrelevant_files" lists to see where you might be able to avoid running heavy jobs on patches that only touch something small. 3. Consider moving some jobs to the experimental queue and run them on-demand for patches that touch particular subsystems or affect particular configurations. 4. Consider some periodic testing for things that maybe don't need to run on every single patch. 5. Re-examine tests that take a long time to run to see if something can be done to make them more efficient. 6. Consider performance improvements in the actual server projects, which also benefits the users.
There's another little used feature of Zuul called "fail fast", it's something used in the Octavia* repos in our gate jobs: project: gate: fail-fast: true Description is: Zuul now supports :attr:`project.<pipeline>.fail-fast` to immediately report and cancel builds on the first failure in a buildset. I feel it's useful for gate jobs since they've already gone through the check queue and typically shouldn't fail. For example, a mirror failure should stop things quickly, since the next action will most likely be a 'recheck' anyways. And thinking along those lines, I remember a discussion years ago about having a 'canary' job, [0] (credit to Gmann and Jeremy). Is having a multi-stage pipeline where the 'low impact' jobs are run first - pep8, unit, functional, docs, and only if they pass run things like Tempest, more palatable now? I realize there are some downsides, but it mostly penalizes those that have failed to run the simple checks locally before pushing out a review. Just wanted to throw it out there. -Brian [0] http://lists.openstack.org/pipermail/openstack-discuss/2018-December/000755....