[neutron][CI] How to reduce number of rechecks - brainstorming
katonalala at gmail.com
Mon Nov 29 10:11:10 UTC 2021
I am not sure what is the current status of elastic, but we should use
again elastic-recheck, keep the bug definitions up-to-date and dedicate
time to keep it alive.
>From the zuul status page at least it seems it has fresh data:
It could help reviewers to see feedback from elastic-recheck if the issue
in the given patch is an already known bug.
Lajos Katona (lajoskatona)
Oleg Bondarev <oleg.bondarev at huawei.com> ezt írta (időpont: 2021. nov. 29.,
> A few thoughts from my side in scope of brainstorm:
> 1) Recheck actual bugs (“recheck bug 123456”)
> - not a new idea to better keep track of all failures
> - force a developer to investigate the reason of each CI failure and
> increase corresponding bug rating, or file a new bug (or go and fix this
> bug finally!)
> - I think we should have some gate failure bugs dashboard with
> hottest bugs on top (maybe there is one that I’m not aware of) so everyone
> could go and check if his CI failure is known or new
> - simple “recheck” could be forbidden, at least during “crisis
> management” window
> 2) Allow recheck TIMEOUT/POST_FAILURE jobs
> - while I agree that re-run particular jobs is evil,
> TIMEOUT/POST_FAILURE are not related to the patch in majority of cases
> - performance issues are usually caught by Rally jobs
> - of course core team should monitor if timeouts become a rule for
> some jobs
> 3) Ability to block rechecks in some cases, like known gate blocker
> - not everyone is always aware that gates are blocked with some issue
> - PTL (or any core team member) can turn off rechecks during that
> time (with a message from Zuul)
> - happens not often but still can save some CI resources
> Advanced Software Technology Lab
> -----Original Message-----
> From: Slawek Kaplonski [mailto:skaplons at redhat.com]
> Sent: Thursday, November 18, 2021 10:46 AM
> To: Clark Boylan <cboylan at sapwetik.org>
> Cc: openstack-discuss at lists.openstack.org
> Subject: Re: [neutron][CI] How to reduce number of rechecks - brainstorming
> Thx Clark for detailed explanation about that :)
> On środa, 17 listopada 2021 16:51:57 CET you wrote:
> > On Wed, Nov 17, 2021, at 2:18 AM, Balazs Gibizer wrote:
> > Snip. I want to respond to a specific suggestion:
> > > 3) there was informal discussion before about a possibility to
> > > re-run only some jobs with a recheck instead for re-running the
> > > whole set. I don't know if this is feasible with Zuul and I think
> > > this only treat the symptom not the root case. But still this could
> > > be a direction if all else fails.
> > OpenStack has configured its check and gate queues with something
> > we've
> > "clean check". This refers to the requirement that before an OpenStack
> > project can be gated it must pass check tests first. This policy was
> > instituted because a number of these infrequent but problematic issues
> > were traced back to recheck spamming. Basically changes would show up
> > and were broken. They would fail some percentage of the time. They got
> > rechecked
> > they finally merged and now their failure rate is added to the whole.
> > This rule was introduced to make it more difficult to get this
> > flakyness into the gate.
> > Locking in test results is in direct opposition to the existing policy
> > and goals. Locking results would make it far more trivial to land such
> > flakyness as you wouldn't need entire sets of jobs to pass before you
> could land.
> > Instead you could rerun individual jobs until each one passed and then
> > land the result. Potentially introducing significant flakyness with a
> > single merge.
> > Locking results is also not really something that fits well with the
> > speculative gate queues that Zuul runs. Remember that Zuul constructs
> > a future git state and tests that in parallel. Currently the state for
> > OpenStack looks like:
> > A - Nova
> > ^
> > B - Glance
> > ^
> > C - Neutron
> > ^
> > D - Neutron
> > ^
> > F - Neutron
> > The B glance change is tested as if the A Nova change has already
> > merged and so on down the queue. If we want to keep these speculative
> > states we can't really have humans manually verify a failure can be
> ignored and retry it.
> > Because we'd be enqueuing job builds at different stages of
> > speculative state. Each job build would be testing a different version
> of the software.
> > What we could do is implement a retry limit for failing jobs. Zuul
> > could
> > failing jobs X times before giving up and reporting failure (this
> > would require updates to Zuul). The problem with this approach is
> > without some oversight it becomes very easy to land changes that make
> > things worse. As a side note Zuul does do retries, but only for
> > detected network errors or when a pre-run playbook fails. The
> > assumption is that network failures are due to the dangers of the
> > Internet, and that pre-run playbooks are small, self contained,
> > unlikely to fail, and when they do fail the failure should be
> independent of what is being tested.
> > Where does that leave us?
> > I think it is worth considering the original goals of "clean check".
> > We know that rechecking/rerunning only makes these problems worse in the
> long term.
> > They represent technical debt. One of the reasons we run these tests
> > is to show us when our software is broken. In the case of flaky
> > results we are exposing this technical debt where it impacts the
> > functionality of our software. The longer we avoid fixing these issues
> > the worse it gets, and
> > is true even with "clean check".
> I agree with You on that and I would really like to find better/other
> solution for the Neutron problem than rechecking only broken jobs as I'm
> pretty sure that this would make things much worst quickly.
> > Do we as developers find value in knowing the software needs attention
> > it gets released to users? Do the users find value in running reliable
> > software? In the past we have asserted that "yes, there is value in
> > this", and have invested in tracking, investigating, and fixing these
> > problems even if they happen infrequently. But that does require
> > investment, and active maintenance.
> > Clark
> Slawek Kaplonski
> Principal Software Engineer
> Red Hat
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the openstack-discuss