---- On Thu, 18 Nov 2021 01:42:22 -0600 Slawek Kaplonski <skaplons@redhat.com> wrote ----
Hi,
On środa, 17 listopada 2021 11:18:03 CET Balazs Gibizer wrote:
On Wed, Nov 17 2021 at 09:13:34 AM +0100, Slawek Kaplonski
<skaplons@redhat.com> wrote:
Hi,
Recently I spent some time to check how many rechecks we need in Neutron to get patch merged and I compared it to some other OpenStack projects (see [1] for details). TL;DR - results aren't good for us and I think we really need to do something with that.
I really like the idea of collecting such stats. Thank you for doing it. I can even imagine to make a public dashboard somewhere with this information as it is a good indication about the health of our projects / testing.
Thx. So far it's just simple script which I run from my terminal to get that data. Nothing else. If You want to use it, it's here https://github.com/ slawqo/tools/tree/master/rechecks
Of course "easiest" thing to say is that we should fix issues which we are hitting in the CI to make jobs more stable. But it's not that easy. We are struggling with those jobs for very long time. We have CI related meeting every week and we are fixing what we can there. Unfortunately there is still bunch of issues which we can't fix so far because they are intermittent and hard to reproduce locally or in some cases the issues aren't realy related to the Neutron or there are new bugs which we need to investigate and fix :)
I have couple of suggestion based on my experience working with CI in nova.
1) we try to open bug reports for intermittent gate failures too and keep them tagged in a list [1] so when a job fail it is easy to check if the bug is known.
Thx. We are trying more or less to do that, but TBH I think that in many cases we didn't open LPs for such issues. I added it to the list of ideas :)
+1, I think opening bugs is the best way to get the project notified and also track the issue. I like the Slawek script to collect the recheck per project and that is something we can use in TC tracking the gate health in the weekly meeting and see which project is having more recheck, Recheck does not mean that project has the issue but at least we will encourage members to open bug on corresponding projects. -gmann
2) I offer my help here now that if you see something in neutron runs that feels non neutron specific then ping me with it. Maybe we are struggling with the same problem too.
Thank a lot. I will for sure ping You when I will see something like that.
3) there was informal discussion before about a possibility to re-run only some jobs with a recheck instead for re-running the whole set. I don't know if this is feasible with Zuul and I think this only treat the symptom not the root case. But still this could be a direction if all else fails.
yes, I remember that discussion and I totally understand pros and cons of such solution, but I added it to the list as well.
Cheers, gibi
So this is never ending battle for us. The problem is that we have to test various backends, drivers, etc. so as a result we have many jobs running on each patch - excluding UT, pep8 and docs jobs we have around 19 jobs in check and 14 jobs in gate queue.
In the past we made a lot of improvements, like e.g. we improved irrelevant files lists for jobs to run less jobs on some of the patches, together with QA team we did "integrated-networking" template to run only Neutron and Nova related scenario tests in the Neutron queues, we removed and consolidated some of the jobs (there is still one patch in progress for that but it should just remove around 2 jobs from the check queue). All of that are good improvements but still not enough to make our CI really stable :/
Because of all of that, I would like to ask community about any other ideas how we can improve that. If You have any ideas, please send it in this email thread or reach out to me directly on irc. We want to discuss about them in the next video CI meeting which will be on November 30th. If You would have any idea and would like to join that discussion, You are more than welcome in that meeting of course :)
[1] http://lists.openstack.org/pipermail/openstack-discuss/2021-November/ 025759.html
[1] https://bugs.launchpad.net/nova/+bugs?field.tag=gate-failure&orderby=-date_las t_updated&start=0
-- Slawek Kaplonski Principal Software Engineer Red Hat
-- Slawek Kaplonski Principal Software Engineer Red Hat