[all] Gate resources and performance

Brian Haley haleyb.dev at gmail.com
Fri Feb 5 16:02:46 UTC 2021


On 2/4/21 12:28 PM, Dan Smith wrote:
> Hi all,
> 
> I have become increasingly concerned with CI performance lately, and
> have been raising those concerns with various people. Most specifically,
> I'm worried about our turnaround time or "time to get a result", which
> has been creeping up lately. Right after the beginning of the year, we
> had a really bad week where the turnaround time was well over 24
> hours. That means if you submit a patch on Tuesday afternoon, you might
> not get a test result until Thursday. That is, IMHO, a real problem and
> massively hurts our ability to quickly merge priority fixes as well as
> just general velocity and morale. If people won't review my code until
> they see a +1 from Zuul, and that is two days after I submitted it,
> that's bad.

Thanks for raising the issue Dan, I've definitely been hit by this issue 
myself.

> Now, obviously nobody wants to run fewer tests on patches before they
> land, and I'm not really suggesting that we take that approach
> necessarily. However, I think there are probably a lot of places that we
> can cut down the amount of *work* we do. Some ways to do this are:
> 
> 1. Evaluate whether or not you need to run all of tempest on two
>     configurations of a devstack on each patch. Maybe having a
>     stripped-down tempest (like just smoke) to run on unique configs, or
>     even specific tests.
> 2. Revisit your "irrelevant_files" lists to see where you might be able
>     to avoid running heavy jobs on patches that only touch something
>     small.
> 3. Consider moving some jobs to the experimental queue and run them
>     on-demand for patches that touch particular subsystems or affect
>     particular configurations.
> 4. Consider some periodic testing for things that maybe don't need to
>     run on every single patch.
> 5. Re-examine tests that take a long time to run to see if something can
>     be done to make them more efficient.
> 6. Consider performance improvements in the actual server projects,
>     which also benefits the users.

There's another little used feature of Zuul called "fail fast", it's 
something used in the Octavia* repos in our gate jobs:

project:
   gate:
     fail-fast: true

Description is:

   Zuul now supports :attr:`project.<pipeline>.fail-fast` to immediately
   report and cancel builds on the first failure in a buildset.

I feel it's useful for gate jobs since they've already gone through the 
check queue and typically shouldn't fail.  For example, a mirror failure 
should stop things quickly, since the next action will most likely be a 
'recheck' anyways.

And thinking along those lines, I remember a discussion years ago about 
having a 'canary' job, [0] (credit to Gmann and Jeremy).  Is having a 
multi-stage pipeline where the 'low impact' jobs are run first - pep8, 
unit, functional, docs, and only if they pass run things like Tempest, 
more palatable now?  I realize there are some downsides, but it mostly 
penalizes those that have failed to run the simple checks locally before 
pushing out a review.  Just wanted to throw it out there.

-Brian

[0] 
http://lists.openstack.org/pipermail/openstack-discuss/2018-December/000755.html



More information about the openstack-discuss mailing list