[neutron][CI] How to reduce number of rechecks - brainstorming

Slawek Kaplonski skaplons at redhat.com
Thu Nov 18 07:46:11 UTC 2021


Hi,

Thx Clark for detailed explanation about that :)

On środa, 17 listopada 2021 16:51:57 CET you wrote:
> On Wed, Nov 17, 2021, at 2:18 AM, Balazs Gibizer wrote:
> 
> Snip. I want to respond to a specific suggestion:
> > 3) there was informal discussion before about a possibility to re-run
> > only some jobs with a recheck instead for re-running the whole set. I
> > don't know if this is feasible with Zuul and I think this only treat
> > the symptom not the root case. But still this could be a direction if
> > all else fails.
> 
> OpenStack has configured its check and gate queues with something we've 
called
> "clean check". This refers to the requirement that before an OpenStack
> project can be gated it must pass check tests first. This policy was
> instituted because a number of these infrequent but problematic issues were
> traced back to recheck spamming. Basically changes would show up and were
> broken. They would fail some percentage of the time. They got rechecked 
until
> they finally merged and now their failure rate is added to the whole. This
> rule was introduced to make it more difficult to get this flakyness into the
> gate.
> 
> Locking in test results is in direct opposition to the existing policy and
> goals. Locking results would make it far more trivial to land such flakyness
> as you wouldn't need entire sets of jobs to pass before you could land.
> Instead you could rerun individual jobs until each one passed and then land
> the result. Potentially introducing significant flakyness with a single
> merge.
> 
> Locking results is also not really something that fits well with the
> speculative gate queues that Zuul runs. Remember that Zuul constructs a
> future git state and tests that in parallel. Currently the state for
> OpenStack looks like:
> 
>   A - Nova
>   ^
>   B - Glance
>   ^
>   C - Neutron
>   ^
>   D - Neutron
>   ^
>   F - Neutron
> 
> The B glance change is tested as if the A Nova change has already merged and
> so on down the queue. If we want to keep these speculative states we can't
> really have humans manually verify a failure can be ignored and retry it.
> Because we'd be enqueuing job builds at different stages of speculative
> state. Each job build would be testing a different version of the software.
> 
> What we could do is implement a retry limit for failing jobs. Zuul could 
rerun
> failing jobs X times before giving up and reporting failure (this would
> require updates to Zuul). The problem with this approach is without some
> oversight it becomes very easy to land changes that make things worse. As a
> side note Zuul does do retries, but only for detected network errors or when
> a pre-run playbook fails. The assumption is that network failures are due to
> the dangers of the Internet, and that pre-run playbooks are small, self
> contained, unlikely to fail, and when they do fail the failure should be
> independent of what is being tested.
> 
> Where does that leave us?
> 
> I think it is worth considering the original goals of "clean check". We know
> that rechecking/rerunning only makes these problems worse in the long term.
> They represent technical debt. One of the reasons we run these tests is to
> show us when our software is broken. In the case of flaky results we are
> exposing this technical debt where it impacts the functionality of our
> software. The longer we avoid fixing these issues the worse it gets, and 
this
> is true even with "clean check".

I agree with You on that and I would really like to find better/other solution 
for the Neutron problem than rechecking only broken jobs as I'm pretty sure 
that this would make things much worst quickly.

> 
> Do we as developers find value in knowing the software needs attention 
before
> it gets released to users? Do the users find value in running reliable
> software? In the past we have asserted that "yes, there is value in this",
> and have invested in tracking, investigating, and fixing these problems even
> if they happen infrequently. But that does require investment, and active
> maintenance.
> 
> Clark


-- 
Slawek Kaplonski
Principal Software Engineer
Red Hat
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20211118/5e5efcfb/attachment-0001.sig>


More information about the openstack-discuss mailing list