<div dir="ltr">The neutron full job is finally voting, and the first patch [1] has already passed it in gate checks!<div>I've collected a few data points before it was switched to voting, and we should probably expect a failure rate around 4%. This is not bad, but neither great, and everybody's contribution will be appreciated in reporting and assessing the nature gate failures, which, needless to say, are mostly races.</div>
<div><div><br></div><div>Note: we've also added the postgresql version of the same job, but that is not voting yet as we never executed it before.</div><div><br></div><div>Salvatore<br><div><br></div><div>[1] <a href="https://review.openstack.org/#/c/105694/">https://review.openstack.org/#/c/105694/</a></div>
</div></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On 12 August 2014 20:14, Salvatore Orlando <span dir="ltr"><<a href="mailto:sorlando@nicira.com" target="_blank">sorlando@nicira.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">And just when the patch was only missing a +A, another bug slipped in!<div>The nova patch to fix it is available at [1] <br>
</div><div><br></div><div>And while we're there, it won't be a bad idea to also push the neutron full job, as non-voting, into the integrated gate [2]</div>
<div><br></div><div>Thanks in advance,</div><div>(especially to the nova and infra cores who'll review these patches!)</div><div>Salvatore</div><div><br></div><div>[1] <a href="https://review.openstack.org/#/c/113554/" target="_blank">https://review.openstack.org/#/c/113554/</a></div>
<div>[2] <a href="https://review.openstack.org/#/c/113562/" target="_blank">https://review.openstack.org/#/c/113562/</a></div></div><div class="HOEnZb"><div class="h5"><div class="gmail_extra"><br><br><div class="gmail_quote">
On 7 August 2014 17:51, Salvatore Orlando <span dir="ltr"><<a href="mailto:sorlando@nicira.com" target="_blank">sorlando@nicira.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Thanks Armando,<div><br></div><div>The fix for the bug you pointed out was the reason of the failure we've been seeing.</div>
<div>The follow-up patch merged and I've removed the wip status from the patch for the full job [1]</div>
<div><br></div><div>Salvatore </div><div><br></div><div>[1] <a href="https://review.openstack.org/#/c/88289/" target="_blank">https://review.openstack.org/#/c/88289/</a></div></div><div><div><div class="gmail_extra">
<br><br><div class="gmail_quote">
On 7 August 2014 16:50, Armando M. <span dir="ltr"><<a href="mailto:armamig@gmail.com" target="_blank">armamig@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr">Hi Salvatore,<div><br></div><div>I did notice the issue and I flagged this bug report:</div><div><br></div><div><a href="https://bugs.launchpad.net/nova/+bug/1352141" target="_blank">https://bugs.launchpad.net/nova/+bug/1352141</a><br>
</div><div><br></div><div>I'll follow up.</div><div><br></div><div>Cheers,</div><div>Armando</div></div><div><div><div class="gmail_extra"><br><br><div class="gmail_quote">On 7 August 2014 01:34, Salvatore Orlando <span dir="ltr"><<a href="mailto:sorlando@nicira.com" target="_blank">sorlando@nicira.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">I had to put the patch back on WIP because yesterday a bug causing a 100% failure rate slipped in.<div>
It should be an easy fix, and I'm already working on it.</div>
<div>Situations like this, exemplified by [1] are a bit frustrating for all the people working on improving neutron quality.<br>
</div><div>Now, if you allow me a little rant, as Neutron is receiving a lot of attention for all the ongoing discussion regarding this group policy stuff, would it be possible for us to receive a bit of attention to ensure both the full job and the grenade one are switched to voting before the juno-3 review crunch.</div>
<div><br></div><div>We've already had the attention of the QA team, it would probably good if we could get the attention of the infra core team to ensure:</div><div>1) the jobs are also deemed by them stable enough to be switched to voting</div>
<div>2) the relevant patches for openstack-infra/config are reviewed</div><div><br></div><div>Regards,</div><div>Salvatore</div><div><br></div><div>[1] <a href="http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwie3UnbWVzc2FnZSc6IHUnRmxvYXRpbmcgaXAgcG9vbCBub3QgZm91bmQuJywgdSdjb2RlJzogNDAwfVwiIEFORCBidWlsZF9uYW1lOlwiY2hlY2stdGVtcGVzdC1kc3ZtLW5ldXRyb24tZnVsbFwiIEFORCBidWlsZF9icmFuY2g6XCJtYXN0ZXJcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiMTcyODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTQwNzQwMDExMDIwNywibW9kZSI6IiIsImFuYWx5emVfZmllbGQiOiIifQ==" target="_blank">http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwie3UnbWVzc2FnZSc6IHUnRmxvYXRpbmcgaXAgcG9vbCBub3QgZm91bmQuJywgdSdjb2RlJzogNDAwfVwiIEFORCBidWlsZF9uYW1lOlwiY2hlY2stdGVtcGVzdC1kc3ZtLW5ldXRyb24tZnVsbFwiIEFORCBidWlsZF9icmFuY2g6XCJtYXN0ZXJcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiMTcyODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTQwNzQwMDExMDIwNywibW9kZSI6IiIsImFuYWx5emVfZmllbGQiOiIifQ==</a></div>
</div><div><div><div class="gmail_extra"><br><br><div class="gmail_quote">On 23 July 2014 14:59, Matthew Treinish <span dir="ltr"><<a href="mailto:mtreinish@kortar.org" target="_blank">mtreinish@kortar.org</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div>On Wed, Jul 23, 2014 at 02:40:02PM +0200, Salvatore Orlando wrote:<br>
> Here I am again bothering you with the state of the full job for Neutron.<br>
><br>
> The patch for fixing an issue in nova's server external events extension<br>
> merged yesterday [1]<br>
> We do not have yet enough data points to make a reliable assessment, but of<br>
> out 37 runs since the patch merged, we had "only" 5 failures, which puts<br>
> the failure rate at about 13%<br>
><br>
> This is ugly compared with the current failure rate of the smoketest (3%).<br>
> However, I think it is good enough to start making the full job voting at<br>
> least for neutron patches.<br>
> Once we'll be able to bring down failure rate to anything around 5%, we can<br>
> then enable the job everywhere.<br>
<br>
</div>I think that sounds like a good plan. I'm also curious how the failure rates<br>
compare to the other non-neutron jobs, that might be a useful comparison too<br>
for deciding when to flip the switch everywhere.<br>
<div><br>
><br>
> As much as I hate asymmetric gating, I think this is a good compromise for<br>
> avoiding developers working on other projects are badly affected by the<br>
> higher failure rate in the neutron full job.<br>
<br>
</div>So we discussed this during the project meeting a couple of weeks ago [3] and<br>
there was a general agreement that doing it asymmetrically at first would be<br>
better. Everyone should be wary of the potential harms with doing it<br>
asymmetrically and I think priority will be given to fixing issues that block<br>
the neutron gate should they arise.<br>
<div><br>
> I will therefore resume work on [2] and remove the WIP status as soon as I<br>
> can confirm a failure rate below 15% with more data points.<br>
><br>
<br>
</div>Thanks for keeping on top of this Salvatore. It'll be good to finally be at<br>
least partially gating with a parallel job.<br>
<br>
-Matt Treinish<br>
<div><br>
><br>
> [1] <a href="https://review.openstack.org/#/c/103865/" target="_blank">https://review.openstack.org/#/c/103865/</a><br>
> [2] <a href="https://review.openstack.org/#/c/88289/" target="_blank">https://review.openstack.org/#/c/88289/</a><br>
</div>[3] <a href="http://eavesdrop.openstack.org/meetings/project/2014/project.2014-07-08-21.03.log.html#l-28" target="_blank">http://eavesdrop.openstack.org/meetings/project/2014/project.2014-07-08-21.03.log.html#l-28</a><br>
<div><div><br>
><br>
><br>
> On 10 July 2014 11:49, Salvatore Orlando <<a href="mailto:sorlando@nicira.com" target="_blank">sorlando@nicira.com</a>> wrote:<br>
><br>
> ><br>
> ><br>
> ><br>
> > On 10 July 2014 11:27, Ihar Hrachyshka <<a href="mailto:ihrachys@redhat.com" target="_blank">ihrachys@redhat.com</a>> wrote:<br>
> ><br>
> >> -----BEGIN PGP SIGNED MESSAGE-----<br>
> >> Hash: SHA512<br>
> >><br>
> >> On 10/07/14 11:07, Salvatore Orlando wrote:<br>
> >> > The patch for bug 1329564 [1] merged about 11 hours ago. From [2]<br>
> >> > it seems there has been an improvement on the failure rate, which<br>
> >> > seem to have dropped to 25% from over 40%. Still, since the patch<br>
> >> > merged there have been 11 failures already in the full job out of<br>
> >> > 42 jobs executed in total. Of these 11 failures: - 3 were due to<br>
> >> > problems in the patches being tested - 1 had the same root cause as<br>
> >> > bug 1329564. Indeed the related job started before the patch merged<br>
> >> > but finished after. So this failure "doesn't count". - 1 was for an<br>
> >> > issue introduced about a week ago which actually causing a lot of<br>
> >> > failures in the full job [3]. Fix should be easy for it; however<br>
> >> > given the nature of the test we might even skip it while it's<br>
> >> > fixed. - 3 were for bug 1333654 [4]; for this bug discussion is<br>
> >> > going on on gerrit regarding the most suitable approach. - 3 were<br>
> >> > for lock wait timeout errors. Several people in the community are<br>
> >> > already working on them. I hope this will raise the profile of this<br>
> >> > issue (maybe some might think it's just a corner case as it rarely<br>
> >> > causes failures in smoke jobs, whereas the truth is that error<br>
> >> > occurs but it does not cause job failure because the jobs isn't<br>
> >> > parallel).<br>
> >><br>
> >> Can you give directions on where to find those lock timeout failures?<br>
> >> I'd like to check logs to see whether they have the same nature as<br>
> >> most other failures (e.g. improper yield under transaction).<br>
> >><br>
> ><br>
> > This logstash query will give you all occurences of lock wait timeout<br>
> > issues: message:"(OperationalError) (1205, 'Lock wait timeout exceeded; try<br>
> > restarting transaction')" AND tags:"screen-q-svc.txt"<br>
> ><br>
> > The fact that in most cases the build succeeds anyway is misleading,<br>
> > because in many cases these errors occur in RPC handling between agents and<br>
> > servers, and therefore are not detected by tempest. The neutron full job,<br>
> > which is parallel, increases their occurrence because of parallelism - and<br>
> > since API request too occur concurrently it also yields a higher tempest<br>
> > build failure rate.<br>
> ><br>
> > However, as I argued in the past the "lock wait timeout" error should<br>
> > always be treated as an error condition.<br>
> > Eugene has already classified lock wait timeout failures and filed bugs<br>
> > for them a few weeks ago.<br>
> ><br>
> ><br>
> >> ><br>
> >> > Summarizing, I think time is not yet ripe to enable the full job;<br>
> >> > once bug 1333654 is fixed, we should go for it. AFAIK there is no<br>
> >> > way for working around it in gate tests other than disabling<br>
> >> > nova/neutron event reporting, which I guess we don't want to do.<br>
> >> ><br>
> >> > Salvatore<br>
> >> ><br>
> >> > [1] <a href="https://review.openstack.org/#/c/105239" target="_blank">https://review.openstack.org/#/c/105239</a> [2]<br>
> >> ><br>
> >> <a href="http://logstash.openstack.org/#eyJzZWFyY2giOiJidWlsZF9zdGF0dXM6RkFJTFVSRSBBTkQgbWVzc2FnZTpcIkZpbmlzaGVkOiBGQUlMVVJFXCIgQU5EIGJ1aWxkX25hbWU6XCJjaGVjay10ZW1wZXN0LWRzdm0tbmV1dHJvbi1mdWxsXCIgQU5EIGJ1aWxkX2JyYW5jaDpcIm1hc3RlclwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiIxNzI4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsiZnJvbSI6IjIwMTQtMDctMTBUMDA6MjQ6NTcrMDA6MDAiLCJ0byI6IjIwMTQtMDctMTBUMDg6MjQ6NTMrMDA6MDAiLCJ1c2VyX2ludGVydmFsIjoiMCJ9LCJzdGFtcCI6MTQwNDk4MjU2MjM2OCwibW9kZSI6IiIsImFuYWx5emVfZmllbGQiOiIifQ==" target="_blank">http://logstash.openstack.org/#eyJzZWFyY2giOiJidWlsZF9zdGF0dXM6RkFJTFVSRSBBTkQgbWVzc2FnZTpcIkZpbmlzaGVkOiBGQUlMVVJFXCIgQU5EIGJ1aWxkX25hbWU6XCJjaGVjay10ZW1wZXN0LWRzdm0tbmV1dHJvbi1mdWxsXCIgQU5EIGJ1aWxkX2JyYW5jaDpcIm1hc3RlclwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiIxNzI4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsiZnJvbSI6IjIwMTQtMDctMTBUMDA6MjQ6NTcrMDA6MDAiLCJ0byI6IjIwMTQtMDctMTBUMDg6MjQ6NTMrMDA6MDAiLCJ1c2VyX2ludGVydmFsIjoiMCJ9LCJzdGFtcCI6MTQwNDk4MjU2MjM2OCwibW9kZSI6IiIsImFuYWx5emVfZmllbGQiOiIifQ==</a><br>
> >> ><br>
> >> ><br>
> >> [3]<br>
> >> ><br>
> >> <a href="http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiSFRUUEJhZFJlcXVlc3Q6IFVucmVjb2duaXplZCBhdHRyaWJ1dGUocykgJ21lbWJlciwgdmlwLCBwb29sLCBoZWFsdGhfbW9uaXRvcidcIiBBTkQgdGFnczpcInNjcmVlbi1xLXN2Yy50eHRcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiY3VzdG9tIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7ImZyb20iOiIyMDE0LTA3LTAxVDA4OjU5OjAxKzAwOjAwIiwidG8iOiIyMDE0LTA3LTEwVDA4OjU5OjAxKzAwOjAwIiwidXNlcl9pbnRlcnZhbCI6IjAifSwic3RhbXAiOjE0MDQ5ODI3OTc3ODAsIm1vZGUiOiIiLCJhbmFseXplX2ZpZWxkIjoiIn0=" target="_blank">http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiSFRUUEJhZFJlcXVlc3Q6IFVucmVjb2duaXplZCBhdHRyaWJ1dGUocykgJ21lbWJlciwgdmlwLCBwb29sLCBoZWFsdGhfbW9uaXRvcidcIiBBTkQgdGFnczpcInNjcmVlbi1xLXN2Yy50eHRcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiY3VzdG9tIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7ImZyb20iOiIyMDE0LTA3LTAxVDA4OjU5OjAxKzAwOjAwIiwidG8iOiIyMDE0LTA3LTEwVDA4OjU5OjAxKzAwOjAwIiwidXNlcl9pbnRlcnZhbCI6IjAifSwic3RhbXAiOjE0MDQ5ODI3OTc3ODAsIm1vZGUiOiIiLCJhbmFseXplX2ZpZWxkIjoiIn0=</a><br>
> >> ><br>
> >> ><br>
> >> [4] <a href="https://bugs.launchpad.net/nova/+bug/1333654" target="_blank">https://bugs.launchpad.net/nova/+bug/1333654</a><br>
> >> ><br>
> >> ><br>
> >> > On 2 July 2014 17:57, Salvatore Orlando <<a href="mailto:sorlando@nicira.com" target="_blank">sorlando@nicira.com</a>><br>
> >> > wrote:<br>
> >> ><br>
> >> >> Hi again,<br>
> >> >><br>
> >> >> From my analysis most of the failures affecting the neutron full<br>
> >> >> job are because of bugs [1] and [2] for which patch [3] and [4]<br>
> >> >> have been proposed. Both patches address the nova side of the<br>
> >> >> neutron/nova notification system for vif plugging. It is worth<br>
> >> >> noting that these bugs did manifest only in the neutron full job<br>
> >> >> not because of its "full" nature, but because of its "parallel"<br>
> >> >> nature.<br>
> >> >><br>
> >> >> Openstackers with a good memory will probably remember we fixed<br>
> >> >> the parallel job back in January, before the massive "kernel bug"<br>
> >> >> gate outage [5]. However, since parallel testing was<br>
> >> >> unfortunately never enabled on the smoke job we run on the gate,<br>
> >> >> we allowed new bugs to slip in. For this reason I would recommend<br>
> >> >> the following: - once patches [3] and [4] have been reviewed and<br>
> >> >> merge, re-assess neutron full job failure rate over a period of<br>
> >> >> 48 hours (72 if the period includes at least 24 hours within a<br>
> >> >> weekend - GMT time) - turn neutron full job to voting if the<br>
> >> >> previous step reveals a failure rate below 10%, otherwise go back<br>
> >> >> to the drawing board<br>
> >> >><br>
> >> >> In my opinion whether the full job should be enabled in an<br>
> >> >> asymmetric fashion or not should be a decision for the QA and<br>
> >> >> Infra teams. Once the full job is made voting there will<br>
> >> >> inevitably be a higher failure rate. An asymmetric gate will not<br>
> >> >> cause backlogs on other projects, so less angry people, but as<br>
> >> >> Matt said it will still allow other bugs to slip in. Personally<br>
> >> >> I'm ok either way.<br>
> >> >><br>
> >> >> The reason why we're expecting a higher failure rate on the full<br>
> >> >> job is that we have already observed that some "known" bugs, such<br>
> >> >> as the various lock timeout issues affecting neutron tend to show<br>
> >> >> with a higher frequency on the full job because of its parallel<br>
> >> >> nature.<br>
> >> >><br>
> >> >> Salvatore<br>
> >> >><br>
> >> >> [1] <a href="https://launchpad.net/bugs/1329546" target="_blank">https://launchpad.net/bugs/1329546</a> [2]<br>
> >> >> <a href="https://launchpad.net/bugs/1333654" target="_blank">https://launchpad.net/bugs/1333654</a> [3]<br>
> >> >> <a href="https://review.openstack.org/#/c/99182/" target="_blank">https://review.openstack.org/#/c/99182/</a> [4]<br>
> >> >> <a href="https://review.openstack.org/#/c/103865/" target="_blank">https://review.openstack.org/#/c/103865/</a> [5]<br>
> >> >> <a href="https://bugs.launchpad.net/neutron/+bug/1273386" target="_blank">https://bugs.launchpad.net/neutron/+bug/1273386</a><br>
> >> >><br>
> >> >><br>
> >> >><br>
> >> >><br>
> >> >> On 25 June 2014 23:38, Matthew Treinish <<a href="mailto:mtreinish@kortar.org" target="_blank">mtreinish@kortar.org</a>><br>
> >> >> wrote:<br>
> >> >><br>
> >> >>> On Tue, Jun 24, 2014 at 02:14:16PM +0200, Salvatore Orlando<br>
> >> >>> wrote:<br>
> >> >>>> There is a long standing patch [1] for enabling the neutron<br>
> >> >>>> full job. Little before the Icehouse release date, when we<br>
> >> >>>> first pushed this, the neutron full job had a failure rate of<br>
> >> >>>> less than 10%. However, since has come by, and perceived<br>
> >> >>>> failure rates were higher, we ran again this analysis.<br>
> >> >>><br>
> >> >>> So I'm not exactly a fan of having the gates be asymmetrical.<br>
> >> >>> It's very easy for breaks to slip in blocking the neutron gate<br>
> >> >>> if it's not voting everywhere. Especially because I think most<br>
> >> >>> people have been trained to ignore the full job because it's<br>
> >> >>> been nonvoting for so long. Is there a particular reason we<br>
> >> >>> just don't switch everything all at once? I think having a<br>
> >> >>> little bit of friction everywhere during the migration is fine.<br>
> >> >>> Especially if we do it way before a milestone. (as opposed to<br>
> >> >>> the original parallel switch which was right before H-3)<br>
> >> >>><br>
> >> >>>><br>
> >> >>>> Here are the findings in a nutshell. 1) If we were to enable<br>
> >> >>>> the job today we might expect about a 3-fold increase in<br>
> >> >>>> neutron job failures when compared with the smoke test.<br>
> >> >>> This is<br>
> >> >>>> unfortunately not acceptable and we therefore need to<br>
> >> >>>> identify and fix<br>
> >> >>> the<br>
> >> >>>> issues causing the additional failure rate. 2) However this<br>
> >> >>>> also puts us in a position where if we wait until the failure<br>
> >> >>>> rate drops under a given threshold we might end up chasing a<br>
> >> >>> moving<br>
> >> >>>> target as new issues might be introduced at any time since<br>
> >> >>>> the job is<br>
> >> >>> not<br>
> >> >>>> voting. 3) When it comes to evaluating failure rates for a<br>
> >> >>>> non voting job,<br>
> >> >>> taking<br>
> >> >>>> the rough numbers does not mean anything, as that will take<br>
> >> >>>> in account patches 'in progress' which end up failing the<br>
> >> >>>> tests because of<br>
> >> >>> problems in<br>
> >> >>>> the patch themselves.<br>
> >> >>>><br>
> >> >>>> Well, that was pretty much a lot for a "nutshell"; however if<br>
> >> >>>> you're not yet bored to death please go on reading.<br>
> >> >>>><br>
> >> >>>> The data in this post are a bit skewed because of a rise in<br>
> >> >>>> neutron job failures in the past 36 hours. However, this rise<br>
> >> >>>> affects both the full<br>
> >> >>> and<br>
> >> >>>> the smoke job so it does not invalidate what we say here. The<br>
> >> >>>> results<br>
> >> >>> shown<br>
> >> >>>> below are representative of the gate status 12 hours ago.<br>
> >> >>>><br>
> >> >>>> - Neutron smoke job failure rates (all queues) 24 hours:<br>
> >> >>>> 22.4% 48 hours: 19.3% 7 days: 8.96% - Neutron smoke job<br>
> >> >>>> failure rates (gate queue only): 24 hours: 10.41% 48 hours:<br>
> >> >>>> 10.20% 7 days: 3.53% - Neutron full job failure rate (check<br>
> >> >>>> queue only as it's non voting): 24 hours: 31.54% 48 hours:<br>
> >> >>>> 28.87% 7 days: 25.73%<br>
> >> >>>><br>
> >> >>>> Check/Gate Ratio between neutron smoke failures 24 hours:<br>
> >> >>>> 2.15 48 hours: 1.89 7 days: 2.53<br>
> >> >>>><br>
> >> >>>> Estimated job failure rate for neutron full job if it were to<br>
> >> >>>> run in the gate: 24 hours: 14.67% 48 hours: 15.27% 7 days:<br>
> >> >>>> 10.16%<br>
> >> >>>><br>
> >> >>>> The numbers are therefore not terrible, but definitely not<br>
> >> >>>> good enough; looking at the last 7 days the full job will<br>
> >> >>>> have a failure rate about 3 times higher than the smoke job.<br>
> >> >>>><br>
> >> >>>> We then took, as it's usual for us when we do this kind of<br>
> >> >>>> evaluation, a window with a reasonable number of failures (41<br>
> >> >>>> in our case), and<br>
> >> >>> analysed<br>
> >> >>>> them in detail.<br>
> >> >>>><br>
> >> >>>> Of these 41 failures 17 were excluded because of infra<br>
> >> >>>> problems, patches 'in progress', or other transient failures;<br>
> >> >>>> considering that over the<br>
> >> >>> same<br>
> >> >>>> period of time 160 full job runs succeeded this would leave<br>
> >> >>>> us with 24 failures on 184 run, and therefore a failure rate<br>
> >> >>>> of 13.04%, which not<br>
> >> >>> far<br>
> >> >>>> from the estimate.<br>
> >> >>>><br>
> >> >>>> Let's consider now these 24 'real' falures: A) 2 were for<br>
> >> >>>> the SSH timeout (8.33% of failures, 1.08% of total full<br>
> >> >>> job<br>
> >> >>>> runs). These specific failure is being analyzed to see if a<br>
> >> >>>> specific fingerprint can be found B) 2 (8.33% of failures,<br>
> >> >>>> 1.08% of total full job runs) were for a<br>
> >> >>> failure<br>
> >> >>>> in test load balancer basic, which is actually a test design<br>
> >> >>>> issue and<br>
> >> >>> is<br>
> >> >>>> already being addressed [2] C) 7 (29.16% of failures, 3.81%<br>
> >> >>>> of total full job runs) were for an<br>
> >> >>> issue<br>
> >> >>>> while resizing a server, which has been already spotted and<br>
> >> >>>> has a bug in progress [3] D) 5 (20.83% of failures, 2.72% of<br>
> >> >>>> total full job runs) manifested as a failure in<br>
> >> >>>> test_server_address; however the actual root cause was being<br>
> >> >>>> masked by [4]. A bug has been filed [5]; this is the most<br>
> >> >>>> worrying one<br>
> >> >>> in<br>
> >> >>>> my opinion as there are many cases where the fault happens<br>
> >> >>>> but does not trigger a failure because of the way tempest<br>
> >> >>>> tests are designed. E) 6 are because of our friend lock wait<br>
> >> >>>> timeout. This was initially<br>
> >> >>> filed<br>
> >> >>>> as [6] but since then we've closed it to file more detailed<br>
> >> >>>> bug reports<br>
> >> >>> as<br>
> >> >>>> the lock wait timeout can manifest in various places; Eugene<br>
> >> >>>> is leading<br>
> >> >>> the<br>
> >> >>>> effort on this problem with Kevin B.<br>
> >> >>>><br>
> >> >>>><br>
> >> >>>> Summarizing the only failure modes specific to the full job<br>
> >> >>>> seem to be<br>
> >> >>> C &<br>
> >> >>>> D. If we were able to fix those we should reasonably expect a<br>
> >> >>>> failure<br>
> >> >>> rate<br>
> >> >>>> of about 6.5%. That's still almost twice as the smoke job,<br>
> >> >>>> but I deem it acceptable for two reasons: 1- by voting, we<br>
> >> >>>> will avoid new bugs affecting the full job from being<br>
> >> >>>> introduced. it is worth reminding people that any bug<br>
> >> >>>> affecting the full job is likely to affect production<br>
> >> >>>> environments<br>
> >> >>><br>
> >> >>> +1, this is a very good point.<br>
> >> >>><br>
> >> >>>> 2- patches failing in the gate will spur neutron developers<br>
> >> >>>> to quickly<br>
> >> >>> find<br>
> >> >>>> a fix. Patches failing a non voting job will cause some<br>
> >> >>>> neutron core<br>
> >> >>> team<br>
> >> >>>> members to write long and boring posts to the mailing list.<br>
> >> >>>><br>
> >> >>><br>
> >> >>> Well, you can always hope. :) But, in my experience the error<br>
> >> >>> is often fixed quickly but the lesson isn't learned, so it will<br>
> >> >>> just happen again. That's why I think we should just grit our<br>
> >> >>> teeth and turn it on everywhere.<br>
> >> >>><br>
> >> >>>> Salvatore<br>
> >> >>>><br>
> >> >>>><br>
> >> >>>><br>
> >> >>>><br>
> >> >>>> [1] <a href="https://review.openstack.org/#/c/88289/" target="_blank">https://review.openstack.org/#/c/88289/</a> [2]<br>
> >> >>>> <a href="https://review.openstack.org/#/c/98065/" target="_blank">https://review.openstack.org/#/c/98065/</a> [3]<br>
> >> >>>> <a href="https://bugs.launchpad.net/nova/+bug/1329546" target="_blank">https://bugs.launchpad.net/nova/+bug/1329546</a> [4]<br>
> >> >>>> <a href="https://bugs.launchpad.net/tempest/+bug/1332414" target="_blank">https://bugs.launchpad.net/tempest/+bug/1332414</a> [5]<br>
> >> >>>> <a href="https://bugs.launchpad.net/nova/+bug/1333654" target="_blank">https://bugs.launchpad.net/nova/+bug/1333654</a> [5]<br>
> >> >>>> <a href="https://bugs.launchpad.net/nova/+bug/1283522" target="_blank">https://bugs.launchpad.net/nova/+bug/1283522</a><br>
> >> >>><br>
> >> >>> Very cool, thanks for the update Salvatore. I'm very excited to<br>
> >> >>> get this voting.<br>
> >> >>><br>
> >> >>><br>
> >> >>> -Matt Treinish<br>
> >> >>><br>
> >> >>> _______________________________________________ OpenStack-dev<br>
> >> >>> mailing list <a href="mailto:OpenStack-dev@lists.openstack.org" target="_blank">OpenStack-dev@lists.openstack.org</a><br>
> >> >>> <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>
> >> >>><br>
> >> >>><br>
> >> >><br>
> >> ><br>
> >> >>><br>
> >> ><br>
> >> ><br>
> >> > _______________________________________________ OpenStack-dev<br>
> >> > mailing list <a href="mailto:OpenStack-dev@lists.openstack.org" target="_blank">OpenStack-dev@lists.openstack.org</a><br>
> >> > <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>
> >> ><br>
> >> -----BEGIN PGP SIGNATURE-----<br>
> >> Version: GnuPG/MacGPG2 v2.0.22 (Darwin)<br>
> >> Comment: Using GnuPG with Thunderbird - <a href="http://www.enigmail.net/" target="_blank">http://www.enigmail.net/</a><br>
> >><br>
> >> iQEcBAEBCgAGBQJTvlxzAAoJEC5aWaUY1u57FJ8H/i+gPR/VZuWFvkOu7pNTHuSj<br>
> >> 8iSA1LJRGe7I9185Gbh22fVzGlahqDpB2hCJjKtWIcL/ml/pgSNGzafB/DhqUUlL<br>
> >> 4GT1UUHptqlKaNX9GLl9I/bknUBEtpwg3hSBivVdCkRYiVwfX86a2ZeeHaCAONwY<br>
> >> ykhiNgoXhR6mr8oEJEIvtjnTDlodR+1dcEq+Nchf/6Fzd8J29dI2Qu38JkweK/qP<br>
> >> m6koPdKSJFzrneOWMCW0Dta6yBKjb3bMCNJUVO/KSGg+MRuSmrufOmLCW5JFu95S<br>
> >> DWIQSTWs3A+dSy9+xuByClQP9kDpG3aUXxW6uRu5UshHMAF5vLATmdCdK4kBiBY=<br>
> >> =K9qm<br>
> >> -----END PGP SIGNATURE-----<br>
> >><br>
> >> _______________________________________________<br>
> >> OpenStack-dev mailing list<br>
> >> <a href="mailto:OpenStack-dev@lists.openstack.org" target="_blank">OpenStack-dev@lists.openstack.org</a><br>
> >> <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>
> >><br>
> ><br>
> ><br>
<br>
> _______________________________________________<br>
> OpenStack-dev mailing list<br>
> <a href="mailto:OpenStack-dev@lists.openstack.org" target="_blank">OpenStack-dev@lists.openstack.org</a><br>
> <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>
<br>
</div></div><br>_______________________________________________<br>
OpenStack-dev mailing list<br>
<a href="mailto:OpenStack-dev@lists.openstack.org" target="_blank">OpenStack-dev@lists.openstack.org</a><br>
<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>
<br></blockquote></div><br></div>
</div></div><br>_______________________________________________<br>
OpenStack-dev mailing list<br>
<a href="mailto:OpenStack-dev@lists.openstack.org" target="_blank">OpenStack-dev@lists.openstack.org</a><br>
<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>
<br></blockquote></div><br></div>
</div></div><br>_______________________________________________<br>
OpenStack-dev mailing list<br>
<a href="mailto:OpenStack-dev@lists.openstack.org" target="_blank">OpenStack-dev@lists.openstack.org</a><br>
<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>
<br></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>