[openstack-dev] [infra] redo gate jobs only

Jeremy Stanley fungi at yuggoth.org
Thu Aug 3 17:49:06 UTC 2017

On 2017-08-03 08:15:36 +0200 (+0200), Andreas Jaeger wrote:
> "A patchset has to be approved to run tests in the gate pipeline. If the
> patchset has failed in the gate pipeline (it will have been approved to
> get into the gate pipeline) a recheck will first run the check jobs and
> if those pass, it will again run the gate jobs. There is no way to only
> run the gate jobs, the check jobs will first be run again."

The reasons being:

1. There's no good way to decide how long is too long to wait
between passing jobs in check and running jobs in the gate. We used
to not enforce this "clean check" policy and developers would
repeatedly reverify broken changes back into the gate pipeline over
and over creating a significant amount of additional disruption
because their change had passed check jobs once (perhaps many months
earlier). Now they at least only get to disrupt the gate once when
that change gets approved, but after that it won't be able to make
it back into the gate until a fixed revision is uploaded and so
doesn't further slow down the merging of unrelated changes.

2. If a change passes jobs once (in check) and then fails later (in
the gate) then there's a fair chance it's introducing a
nondeterministic bug (one which only manifests sometimes but not on
every run). Back when we used to allow reverification directly in
the gate pipeline for changes which passed check, we had people
rechecking flaky changes until they passed and then reverifying them
over and over after approval until they made them through the gate.
Under these conditions a recheck followed by a reverify could merge
changes which failed jobs 50% of the time; 9 rechecks and 9
reverifies could merge a change which failed jobs 90% of the time on
average. With the current requirements to pass both check and gate
in series, it takes on average 3 rechecks to merge a 50% failing
change and 99 rechecks to merge a 90% failing change.

So basically if a change fails in the gate pipeline, there's good
reason for it to get increased scrutiny at least in the form of
trying the jobs again in the check pipeline before going back to the
gate once more.
Jeremy Stanley
