On 2021-02-05 11:02:46 -0500 (-0500), Brian Haley wrote: [...]
There's another little used feature of Zuul called "fail fast", it's something used in the Octavia* repos in our gate jobs:
project: gate: fail-fast: true
Description is:
Zuul now supports :attr:`project.<pipeline>.fail-fast` to immediately report and cancel builds on the first failure in a buildset.
I feel it's useful for gate jobs since they've already gone through the check queue and typically shouldn't fail. For example, a mirror failure should stop things quickly, since the next action will most likely be a 'recheck' anyways.
And thinking along those lines, I remember a discussion years ago about having a 'canary' job, [0] (credit to Gmann and Jeremy). Is having a multi-stage pipeline where the 'low impact' jobs are run first - pep8, unit, functional, docs, and only if they pass run things like Tempest, more palatable now? I realize there are some downsides, but it mostly penalizes those that have failed to run the simple checks locally before pushing out a review. Just wanted to throw it out there.
-Brian
[0] http://lists.openstack.org/pipermail/openstack-discuss/2018-December/000755....
The fundamental downside to these sorts of defensive approaches is that they make it easier to avoid solving the underlying issues. We've designed Zuul to perform most efficiently when it's running tests which are deterministic and mostly free of "false negative" failures. Making your tests and the software being tested efficient and predictable maximizes CI throughput under such an optimistic model. Sinking engineering effort into workarounds for unstable tests and buggy software is time which could have been invested in improving things instead, but also to a great extent removes a lot of the incentive to bother. Sure it could be seen as a pragmatic approach, accepting that in a large software ecosystem such seemingly pathological problems are actually inevitable, but that strikes me as a bit defeatist. There will of course always be temporary problems resulting from outages/incidents in donated resources or regressions in external dependencies outside our control, but if our background failure rate was significantly reduced it would also be far easier to spot and mitigate an order of magnitude failure increase quickly, rather than trying to find the cause of a sudden 25% uptick in failures. -- Jeremy Stanley