<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Nov 17, 2021 at 5:22 AM Balazs Gibizer <balazs.gibizer@est.tech> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>

<br>

On Wed, Nov 17 2021 at 09:13:34 AM +0100, Slawek Kaplonski <br>

<<a href="mailto:skaplons@redhat.com" target="_blank">skaplons@redhat.com</a>> wrote:<br>

> Hi,<br>

> <br>

> Recently I spent some time to check how many rechecks we need in <br>

> Neutron to<br>

> get patch merged and I compared it to some other OpenStack projects <br>

> (see [1]<br>

> for details).<br>

> TL;DR - results aren't good for us and I think we really need to do <br>

> something<br>

> with that.<br>

<br>

I really like the idea of collecting such stats. Thank you for doing <br>

it. I can even imagine to make a public dashboard somewhere with this <br>

information as it is a good indication about the health of our projects <br>

/ testing.<br>

<br>

> <br>

> Of course "easiest" thing to say is that we should fix issues which <br>

> we are<br>

> hitting in the CI to make jobs more stable. But it's not that easy. <br>

> We are<br>

> struggling with those jobs for very long time. We have CI related <br>

> meeting<br>

> every week and we are fixing what we can there.<br>

> Unfortunately there is still bunch of issues which we can't fix so <br>

> far because<br>

> they are intermittent and hard to reproduce locally or in some cases <br>

> the<br>

> issues aren't realy related to the Neutron or there are new bugs <br>

> which we need<br>

> to investigate and fix :)<br>

<br>

<br>

I have couple of suggestion based on my experience working with CI in <br>

nova.<br></blockquote><div><br></div><div>We've struggled with unstable tests in TripleO as well. Here are some things we tried and implemented:</div><div><br></div><div>1. Created job dependencies so we only ran check tests once we knew we had the resources we needed (example we had pulled containers successfully)</div><div><br></div><div>2. Moved some testing to third party where we have easier control of the environment (note that third party cannot stop a change merging) </div><div><br></div><div>3. Used dependency pipelines to pre-qualify some dependencies ahead of letting them  run wild on our check jobs</div><div><br></div><div>4. Requested testproject runs of changes in a less busy environment before running a full set of tests in a public zuul </div><div><br></div><div>5. Used a skiplist to keep track of tech debt and skip known failures that we could temporarily ignore to keep CI moving along if we're waiting on an external fix.</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

1) we try to open bug reports for intermittent gate failures too and <br>

keep them tagged in a list [1] so when a job fail it is easy to check <br>

if the bug is known.<br>

<br>

2) I offer my help here now that if you see something in neutron runs <br>

that feels non neutron specific then ping me with it. Maybe we are <br>

struggling with the same problem too.<br>

<br>

3) there was informal discussion before about a possibility to re-run <br>

only some jobs with a recheck instead for re-running the whole set. I <br>

don't know if this is feasible with Zuul and I think this only treat <br>

the symptom not the root case. But still this could be a direction if <br>

all else fails.<br>

<br>

Cheers,<br>

gibi<br>

<br>

> So this is  never ending battle for us. The problem is that we have <br>

> to test<br>

> various backends, drivers, etc. so as a result we have many jobs <br>

> running on<br>

> each patch - excluding UT, pep8 and docs jobs we have around 19 jobs <br>

> in check<br>

> and 14 jobs in gate queue.<br>

> <br>

> In the past we made a lot of improvements, like e.g. we improved <br>

> irrelevant<br>

> files lists for jobs to run less jobs on some of the patches, <br>

> together with QA<br>

> team we did "integrated-networking" template to run only Neutron and <br>

> Nova<br>

> related scenario tests in the Neutron queues, we removed and <br>

> consolidated some<br>

> of the jobs (there is still one patch in progress for that but it <br>

> should just<br>

> remove around 2 jobs from the check queue). All of that are good <br>

> improvements<br>

> but still not enough to make our CI really stable :/<br>

> <br>

> Because of all of that, I would like to ask community about any other <br>

> ideas<br>

> how we can improve that. If You have any ideas, please send it in <br>

> this email<br>

> thread or reach out to me directly on irc.<br>

> We want to discuss about them in the next video CI meeting which will <br>

> be on<br>

> November 30th. If You would have any idea and would like to join that<br>

> discussion, You are more than welcome in that meeting of course :)<br>

> <br>

> [1] <br>

> <a href="http://lists.openstack.org/pipermail/openstack-discuss/2021-November/" rel="noreferrer" target="_blank">http://lists.openstack.org/pipermail/openstack-discuss/2021-November/</a><br>

> 025759.html<br>

<br>

<br>

[1] <br>

<a href="https://bugs.launchpad.net/nova/+bugs?field.tag=gate-failure&orderby=-date_last_updated&start=0" rel="noreferrer" target="_blank">https://bugs.launchpad.net/nova/+bugs?field.tag=gate-failure&orderby=-date_last_updated&start=0</a><br>

<br>

<br>

> <br>

> --<br>

> Slawek Kaplonski<br>

> Principal Software Engineer<br>

> Red Hat<br>

<br>

<br>

<br>

</blockquote></div></div>