Re: [infra] A change to Zuul's queuing behavior

9 Dec 2018


      On 2018-12-09 23:14:37 +0900 (+0900), Ghanshyam Mann wrote:
[...]
...
We can optimize the node by removing the job from running queue on
the first failure hit instead of full run and then release the
node. This is a trade-off with getting the all failure once and
fix them all together but I am not sure if that is the case all
time. For example-  if any change has pep8 error then, no need to
run integration tests jobs there.  This at least can save nodes at
some extent.
I can recall plenty of times where I've pushed a change which failed
pep8 on some non-semantic whitespace complaint and also had unit
test or integration test failures. In those cases it's quite obvious
that the pep8 failure reason couldn't have been the reason for the
other failed jobs so seeing them all saved me wasting time on
additional patches and waiting for more rounds of results. For that
matter, a lot of my time as a developer (or even as a reviewer) is
saved by seeing which clusters of jobs fail for a given change. For
example, if I see all unit test jobs fail but integration test jobs
pass I can quickly infer that there may be issues with a unit test
that's being modified and spend less time fumbling around in the
dark with various logs.

It's possible we can save some CI resource consumption with such a
trade-off, but doing so comes at the expense of developer and
reviewer time so we have to make sure it's worthwhile. There was a
point in the past where we did something similar (only run other
jobs if a canary linter job passed), and there are good reasons why
we didn't continue it.
-- 
Jeremy Stanley