[all] Gate resources and performance
Brian Haley
haleyb.dev at gmail.com
Fri Feb 5 16:02:46 UTC 2021
On 2/4/21 12:28 PM, Dan Smith wrote:
> Hi all,
>
> I have become increasingly concerned with CI performance lately, and
> have been raising those concerns with various people. Most specifically,
> I'm worried about our turnaround time or "time to get a result", which
> has been creeping up lately. Right after the beginning of the year, we
> had a really bad week where the turnaround time was well over 24
> hours. That means if you submit a patch on Tuesday afternoon, you might
> not get a test result until Thursday. That is, IMHO, a real problem and
> massively hurts our ability to quickly merge priority fixes as well as
> just general velocity and morale. If people won't review my code until
> they see a +1 from Zuul, and that is two days after I submitted it,
> that's bad.
Thanks for raising the issue Dan, I've definitely been hit by this issue
myself.
> Now, obviously nobody wants to run fewer tests on patches before they
> land, and I'm not really suggesting that we take that approach
> necessarily. However, I think there are probably a lot of places that we
> can cut down the amount of *work* we do. Some ways to do this are:
>
> 1. Evaluate whether or not you need to run all of tempest on two
> configurations of a devstack on each patch. Maybe having a
> stripped-down tempest (like just smoke) to run on unique configs, or
> even specific tests.
> 2. Revisit your "irrelevant_files" lists to see where you might be able
> to avoid running heavy jobs on patches that only touch something
> small.
> 3. Consider moving some jobs to the experimental queue and run them
> on-demand for patches that touch particular subsystems or affect
> particular configurations.
> 4. Consider some periodic testing for things that maybe don't need to
> run on every single patch.
> 5. Re-examine tests that take a long time to run to see if something can
> be done to make them more efficient.
> 6. Consider performance improvements in the actual server projects,
> which also benefits the users.
There's another little used feature of Zuul called "fail fast", it's
something used in the Octavia* repos in our gate jobs:
project:
gate:
fail-fast: true
Description is:
Zuul now supports :attr:`project.<pipeline>.fail-fast` to immediately
report and cancel builds on the first failure in a buildset.
I feel it's useful for gate jobs since they've already gone through the
check queue and typically shouldn't fail. For example, a mirror failure
should stop things quickly, since the next action will most likely be a
'recheck' anyways.
And thinking along those lines, I remember a discussion years ago about
having a 'canary' job, [0] (credit to Gmann and Jeremy). Is having a
multi-stage pipeline where the 'low impact' jobs are run first - pep8,
unit, functional, docs, and only if they pass run things like Tempest,
more palatable now? I realize there are some downsides, but it mostly
penalizes those that have failed to run the simple checks locally before
pushing out a review. Just wanted to throw it out there.
-Brian
[0]
http://lists.openstack.org/pipermail/openstack-discuss/2018-December/000755.html
More information about the openstack-discuss
mailing list