[neutron][CI] How to reduce number of rechecks - brainstorming

Oleg Bondarev oleg.bondarev at huawei.com
Mon Nov 29 07:22:42 UTC 2021


A few thoughts from my side in scope of brainstorm:

1)	Recheck actual bugs (“recheck bug 123456”)
-	not a new idea to better keep track of all failures
-	force a developer to investigate the reason of each CI failure and increase corresponding bug rating, or file a new bug (or go and fix this bug finally!)
-	I think we should have some gate failure bugs dashboard with hottest bugs on top (maybe there is one that I’m not aware of) so everyone could go and check if his CI failure is known or new
-	simple “recheck” could be forbidden, at least during “crisis management” window

2)	Allow recheck TIMEOUT/POST_FAILURE jobs
-	while I agree that re-run particular jobs is evil, TIMEOUT/POST_FAILURE are not related to the patch in majority of cases
-	performance issues are usually caught by Rally jobs
-	of course core team should monitor if timeouts become a rule for some jobs

3)	Ability to block rechecks in some cases, like known gate blocker
-	not everyone is always aware that gates are blocked with some issue
-	PTL (or any core team member) can turn off rechecks during that time (with a message from Zuul)
-	happens not often but still can save some CI resources

Advanced Software Technology Lab

-----Original Message-----
From: Slawek Kaplonski [mailto:skaplons at redhat.com] 
Sent: Thursday, November 18, 2021 10:46 AM
To: Clark Boylan <cboylan at sapwetik.org>
Cc: openstack-discuss at lists.openstack.org
Subject: Re: [neutron][CI] How to reduce number of rechecks - brainstorming


Thx Clark for detailed explanation about that :)

On środa, 17 listopada 2021 16:51:57 CET you wrote:
> On Wed, Nov 17, 2021, at 2:18 AM, Balazs Gibizer wrote:
> Snip. I want to respond to a specific suggestion:
> > 3) there was informal discussion before about a possibility to 
> > re-run only some jobs with a recheck instead for re-running the 
> > whole set. I don't know if this is feasible with Zuul and I think 
> > this only treat the symptom not the root case. But still this could 
> > be a direction if all else fails.
> OpenStack has configured its check and gate queues with something 
> we've
> "clean check". This refers to the requirement that before an OpenStack 
> project can be gated it must pass check tests first. This policy was 
> instituted because a number of these infrequent but problematic issues 
> were traced back to recheck spamming. Basically changes would show up 
> and were broken. They would fail some percentage of the time. They got 
> rechecked
> they finally merged and now their failure rate is added to the whole. 
> This rule was introduced to make it more difficult to get this 
> flakyness into the gate.
> Locking in test results is in direct opposition to the existing policy 
> and goals. Locking results would make it far more trivial to land such 
> flakyness as you wouldn't need entire sets of jobs to pass before you could land.
> Instead you could rerun individual jobs until each one passed and then 
> land the result. Potentially introducing significant flakyness with a 
> single merge.
> Locking results is also not really something that fits well with the 
> speculative gate queues that Zuul runs. Remember that Zuul constructs 
> a future git state and tests that in parallel. Currently the state for 
> OpenStack looks like:
>   A - Nova
>   ^
>   B - Glance
>   ^
>   C - Neutron
>   ^
>   D - Neutron
>   ^
>   F - Neutron
> The B glance change is tested as if the A Nova change has already 
> merged and so on down the queue. If we want to keep these speculative 
> states we can't really have humans manually verify a failure can be ignored and retry it.
> Because we'd be enqueuing job builds at different stages of 
> speculative state. Each job build would be testing a different version of the software.
> What we could do is implement a retry limit for failing jobs. Zuul 
> could
> failing jobs X times before giving up and reporting failure (this 
> would require updates to Zuul). The problem with this approach is 
> without some oversight it becomes very easy to land changes that make 
> things worse. As a side note Zuul does do retries, but only for 
> detected network errors or when a pre-run playbook fails. The 
> assumption is that network failures are due to the dangers of the 
> Internet, and that pre-run playbooks are small, self contained, 
> unlikely to fail, and when they do fail the failure should be independent of what is being tested.
> Where does that leave us?
> I think it is worth considering the original goals of "clean check". 
> We know that rechecking/rerunning only makes these problems worse in the long term.
> They represent technical debt. One of the reasons we run these tests 
> is to show us when our software is broken. In the case of flaky 
> results we are exposing this technical debt where it impacts the 
> functionality of our software. The longer we avoid fixing these issues 
> the worse it gets, and
> is true even with "clean check".

I agree with You on that and I would really like to find better/other solution for the Neutron problem than rechecking only broken jobs as I'm pretty sure that this would make things much worst quickly.

> Do we as developers find value in knowing the software needs attention
> it gets released to users? Do the users find value in running reliable 
> software? In the past we have asserted that "yes, there is value in 
> this", and have invested in tracking, investigating, and fixing these 
> problems even if they happen infrequently. But that does require 
> investment, and active maintenance.
> Clark

Slawek Kaplonski
Principal Software Engineer
Red Hat

More information about the openstack-discuss mailing list