Re: [all] Gate resources and performance

5 Feb 2021


      On 2021-02-05 11:02:46 -0500 (-0500), Brian Haley wrote:
[...]
...
There's another little used feature of Zuul called "fail fast", it's
something used in the Octavia* repos in our gate jobs:
project:
  gate:
    fail-fast: true
Description is:
Zuul now supports :attr:`project.<pipeline>.fail-fast` to immediately
  report and cancel builds on the first failure in a buildset.
I feel it's useful for gate jobs since they've already gone through the
check queue and typically shouldn't fail.  For example, a mirror failure
should stop things quickly, since the next action will most likely be a
'recheck' anyways.
And thinking along those lines, I remember a discussion years ago about
having a 'canary' job, [0] (credit to Gmann and Jeremy).  Is having a
multi-stage pipeline where the 'low impact' jobs are run first - pep8, unit,
functional, docs, and only if they pass run things like Tempest, more
palatable now?  I realize there are some downsides, but it mostly penalizes
those that have failed to run the simple checks locally before pushing out a
review.  Just wanted to throw it out there.
-Brian
[0] http://lists.openstack.org/pipermail/openstack-discuss/2018-December/000755....
The fundamental downside to these sorts of defensive approaches is
that they make it easier to avoid solving the underlying issues.
We've designed Zuul to perform most efficiently when it's running
tests which are deterministic and mostly free of "false negative"
failures. Making your tests and the software being tested efficient
and predictable maximizes CI throughput under such an optimistic
model. Sinking engineering effort into workarounds for unstable
tests and buggy software is time which could have been invested in
improving things instead, but also to a great extent removes a lot
of the incentive to bother.

Sure it could be seen as a pragmatic approach, accepting that in a
large software ecosystem such seemingly pathological problems are
actually inevitable, but that strikes me as a bit defeatist. There
will of course always be temporary problems resulting from
outages/incidents in donated resources or regressions in external
dependencies outside our control, but if our background failure rate
was significantly reduced it would also be far easier to spot and
mitigate an order of magnitude failure increase quickly, rather than
trying to find the cause of a sudden 25% uptick in failures.
-- 
Jeremy Stanley