[neutron][CI] How to reduce number of rechecks - brainstorming

Lajos Katona katonalala at gmail.com
Mon Nov 29 10:11:10 UTC 2021


Hi,
I am not sure what is the current status of elastic, but we should use
again elastic-recheck, keep the bug definitions up-to-date and dedicate
time to keep it alive.
>From the zuul status page at least it seems it has fresh data:
http://status.openstack.org/elastic-recheck/data/integrated_gate.html

It could help reviewers to see feedback from elastic-recheck if the issue
in the given patch is an already known bug.
https://docs.openstack.org/infra/elastic-recheck/readme.html

regards
Lajos Katona (lajoskatona)


Oleg Bondarev <oleg.bondarev at huawei.com> ezt írta (időpont: 2021. nov. 29.,
H, 8:35):

> Hello,
>
> A few thoughts from my side in scope of brainstorm:
>
> 1)      Recheck actual bugs (“recheck bug 123456”)
> -       not a new idea to better keep track of all failures
> -       force a developer to investigate the reason of each CI failure and
> increase corresponding bug rating, or file a new bug (or go and fix this
> bug finally!)
> -       I think we should have some gate failure bugs dashboard with
> hottest bugs on top (maybe there is one that I’m not aware of) so everyone
> could go and check if his CI failure is known or new
> -       simple “recheck” could be forbidden, at least during “crisis
> management” window
>
> 2)      Allow recheck TIMEOUT/POST_FAILURE jobs
> -       while I agree that re-run particular jobs is evil,
> TIMEOUT/POST_FAILURE are not related to the patch in majority of cases
> -       performance issues are usually caught by Rally jobs
> -       of course core team should monitor if timeouts become a rule for
> some jobs
>
> 3)      Ability to block rechecks in some cases, like known gate blocker
> -       not everyone is always aware that gates are blocked with some issue
> -       PTL (or any core team member) can turn off rechecks during that
> time (with a message from Zuul)
> -       happens not often but still can save some CI resources
>
> Thanks,
> Oleg
> ---
> Advanced Software Technology Lab
> Huawei
>
> -----Original Message-----
> From: Slawek Kaplonski [mailto:skaplons at redhat.com]
> Sent: Thursday, November 18, 2021 10:46 AM
> To: Clark Boylan <cboylan at sapwetik.org>
> Cc: openstack-discuss at lists.openstack.org
> Subject: Re: [neutron][CI] How to reduce number of rechecks - brainstorming
>
> Hi,
>
> Thx Clark for detailed explanation about that :)
>
> On środa, 17 listopada 2021 16:51:57 CET you wrote:
> > On Wed, Nov 17, 2021, at 2:18 AM, Balazs Gibizer wrote:
> >
> > Snip. I want to respond to a specific suggestion:
> > > 3) there was informal discussion before about a possibility to
> > > re-run only some jobs with a recheck instead for re-running the
> > > whole set. I don't know if this is feasible with Zuul and I think
> > > this only treat the symptom not the root case. But still this could
> > > be a direction if all else fails.
> >
> > OpenStack has configured its check and gate queues with something
> > we've
> called
> > "clean check". This refers to the requirement that before an OpenStack
> > project can be gated it must pass check tests first. This policy was
> > instituted because a number of these infrequent but problematic issues
> > were traced back to recheck spamming. Basically changes would show up
> > and were broken. They would fail some percentage of the time. They got
> > rechecked
> until
> > they finally merged and now their failure rate is added to the whole.
> > This rule was introduced to make it more difficult to get this
> > flakyness into the gate.
> >
> > Locking in test results is in direct opposition to the existing policy
> > and goals. Locking results would make it far more trivial to land such
> > flakyness as you wouldn't need entire sets of jobs to pass before you
> could land.
> > Instead you could rerun individual jobs until each one passed and then
> > land the result. Potentially introducing significant flakyness with a
> > single merge.
> >
> > Locking results is also not really something that fits well with the
> > speculative gate queues that Zuul runs. Remember that Zuul constructs
> > a future git state and tests that in parallel. Currently the state for
> > OpenStack looks like:
> >
> >   A - Nova
> >   ^
> >   B - Glance
> >   ^
> >   C - Neutron
> >   ^
> >   D - Neutron
> >   ^
> >   F - Neutron
> >
> > The B glance change is tested as if the A Nova change has already
> > merged and so on down the queue. If we want to keep these speculative
> > states we can't really have humans manually verify a failure can be
> ignored and retry it.
> > Because we'd be enqueuing job builds at different stages of
> > speculative state. Each job build would be testing a different version
> of the software.
> >
> > What we could do is implement a retry limit for failing jobs. Zuul
> > could
> rerun
> > failing jobs X times before giving up and reporting failure (this
> > would require updates to Zuul). The problem with this approach is
> > without some oversight it becomes very easy to land changes that make
> > things worse. As a side note Zuul does do retries, but only for
> > detected network errors or when a pre-run playbook fails. The
> > assumption is that network failures are due to the dangers of the
> > Internet, and that pre-run playbooks are small, self contained,
> > unlikely to fail, and when they do fail the failure should be
> independent of what is being tested.
> >
> > Where does that leave us?
> >
> > I think it is worth considering the original goals of "clean check".
> > We know that rechecking/rerunning only makes these problems worse in the
> long term.
> > They represent technical debt. One of the reasons we run these tests
> > is to show us when our software is broken. In the case of flaky
> > results we are exposing this technical debt where it impacts the
> > functionality of our software. The longer we avoid fixing these issues
> > the worse it gets, and
> this
> > is true even with "clean check".
>
> I agree with You on that and I would really like to find better/other
> solution for the Neutron problem than rechecking only broken jobs as I'm
> pretty sure that this would make things much worst quickly.
>
> >
> > Do we as developers find value in knowing the software needs attention
> before
> > it gets released to users? Do the users find value in running reliable
> > software? In the past we have asserted that "yes, there is value in
> > this", and have invested in tracking, investigating, and fixing these
> > problems even if they happen infrequently. But that does require
> > investment, and active maintenance.
> >
> > Clark
>
>
> --
> Slawek Kaplonski
> Principal Software Engineer
> Red Hat
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20211129/8146c92a/attachment-0001.htm>


More information about the openstack-discuss mailing list