[neutron][CI] How to reduce number of rechecks - brainstorming

Slawek Kaplonski skaplons at redhat.com
Thu Nov 18 07:42:22 UTC 2021


Hi,

On środa, 17 listopada 2021 11:18:03 CET Balazs Gibizer wrote:
> On Wed, Nov 17 2021 at 09:13:34 AM +0100, Slawek Kaplonski
> 
> <skaplons at redhat.com> wrote:
> > Hi,
> > 
> > Recently I spent some time to check how many rechecks we need in
> > Neutron to
> > get patch merged and I compared it to some other OpenStack projects
> > (see [1]
> > for details).
> > TL;DR - results aren't good for us and I think we really need to do
> > something
> > with that.
> 
> I really like the idea of collecting such stats. Thank you for doing
> it. I can even imagine to make a public dashboard somewhere with this
> information as it is a good indication about the health of our projects
> / testing.

Thx. So far it's just simple script which I run from my terminal to get that 
data. Nothing else. If You want to use it, it's here https://github.com/
slawqo/tools/tree/master/rechecks

> 
> > Of course "easiest" thing to say is that we should fix issues which
> > we are
> > hitting in the CI to make jobs more stable. But it's not that easy.
> > We are
> > struggling with those jobs for very long time. We have CI related
> > meeting
> > every week and we are fixing what we can there.
> > Unfortunately there is still bunch of issues which we can't fix so
> > far because
> > they are intermittent and hard to reproduce locally or in some cases
> > the
> > issues aren't realy related to the Neutron or there are new bugs
> > which we need
> > to investigate and fix :)
> 
> I have couple of suggestion based on my experience working with CI in
> nova.
> 
> 1) we try to open bug reports for intermittent gate failures too and
> keep them tagged in a list [1] so when a job fail it is easy to check
> if the bug is known.

Thx. We are trying more or less to do that, but TBH I think that in many cases 
we didn't open LPs for such issues.
I added it to the list of ideas :)

> 
> 2) I offer my help here now that if you see something in neutron runs
> that feels non neutron specific then ping me with it. Maybe we are
> struggling with the same problem too.

Thank a lot. I will for sure ping You when I will see something like that.

> 
> 3) there was informal discussion before about a possibility to re-run
> only some jobs with a recheck instead for re-running the whole set. I
> don't know if this is feasible with Zuul and I think this only treat
> the symptom not the root case. But still this could be a direction if
> all else fails.

yes, I remember that discussion and I totally understand pros and cons of such 
solution, but I added it to the list as well.

> 
> Cheers,
> gibi
> 
> > So this is  never ending battle for us. The problem is that we have
> > to test
> > various backends, drivers, etc. so as a result we have many jobs
> > running on
> > each patch - excluding UT, pep8 and docs jobs we have around 19 jobs
> > in check
> > and 14 jobs in gate queue.
> > 
> > In the past we made a lot of improvements, like e.g. we improved
> > irrelevant
> > files lists for jobs to run less jobs on some of the patches,
> > together with QA
> > team we did "integrated-networking" template to run only Neutron and
> > Nova
> > related scenario tests in the Neutron queues, we removed and
> > consolidated some
> > of the jobs (there is still one patch in progress for that but it
> > should just
> > remove around 2 jobs from the check queue). All of that are good
> > improvements
> > but still not enough to make our CI really stable :/
> > 
> > Because of all of that, I would like to ask community about any other
> > ideas
> > how we can improve that. If You have any ideas, please send it in
> > this email
> > thread or reach out to me directly on irc.
> > We want to discuss about them in the next video CI meeting which will
> > be on
> > November 30th. If You would have any idea and would like to join that
> > discussion, You are more than welcome in that meeting of course :)
> > 
> > [1]
> > http://lists.openstack.org/pipermail/openstack-discuss/2021-November/
> > 025759.html
> 
> [1]
> https://bugs.launchpad.net/nova/+bugs?field.tag=gate-failure&orderby=-date_las
> t_updated&start=0
> > --
> > Slawek Kaplonski
> > Principal Software Engineer
> > Red Hat


-- 
Slawek Kaplonski
Principal Software Engineer
Red Hat
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20211118/f0b75ef4/attachment.sig>


More information about the openstack-discuss mailing list