<div dir="ltr">The patch for bug 1329564 [1] merged about 11 hours ago.<div>From [2] it seems there has been an improvement on the failure rate, which seem to have dropped to 25% from over 40%.</div><div>Still, since the patch merged there have been 11 failures already in the full job out of 42 jobs executed in total.</div>
<div>Of these 11 failures:</div><div>- 3 were due to problems in the patches being tested</div><div>- 1 had the same root cause as bug 1329564. Indeed the related job started before the patch merged but finished after. So this failure "doesn't count".</div>
<div>- 1 was for an issue introduced about a week ago which actually causing a lot of failures in the full job [3]. Fix should be easy for it; however given the nature of the test we might even skip it while it's fixed.</div>
<div>- 3 were for bug 1333654 [4]; for this bug discussion is going on on gerrit regarding the most suitable approach.</div><div>- 3 were for lock wait timeout errors. Several people in the community are already working on them. I hope this will raise the profile of this issue (maybe some might think it's just a corner case as it rarely causes failures in smoke jobs, whereas the truth is that error occurs but it does not cause job failure because the jobs isn't parallel).</div>
<div><br></div><div>Summarizing, I think time is not yet ripe to enable the full job; once bug 1333654 is fixed, we should go for it. AFAIK there is no way for working around it in gate tests other than disabling nova/neutron event reporting, which I guess we don't want to do.</div>
<div><br></div><div>Salvatore</div><div><br></div><div>[1] <a href="https://review.openstack.org/#/c/105239">https://review.openstack.org/#/c/105239</a></div><div>[2] <a href="http://logstash.openstack.org/#eyJzZWFyY2giOiJidWlsZF9zdGF0dXM6RkFJTFVSRSBBTkQgbWVzc2FnZTpcIkZpbmlzaGVkOiBGQUlMVVJFXCIgQU5EIGJ1aWxkX25hbWU6XCJjaGVjay10ZW1wZXN0LWRzdm0tbmV1dHJvbi1mdWxsXCIgQU5EIGJ1aWxkX2JyYW5jaDpcIm1hc3RlclwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiIxNzI4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsiZnJvbSI6IjIwMTQtMDctMTBUMDA6MjQ6NTcrMDA6MDAiLCJ0byI6IjIwMTQtMDctMTBUMDg6MjQ6NTMrMDA6MDAiLCJ1c2VyX2ludGVydmFsIjoiMCJ9LCJzdGFtcCI6MTQwNDk4MjU2MjM2OCwibW9kZSI6IiIsImFuYWx5emVfZmllbGQiOiIifQ==">http://logstash.openstack.org/#eyJzZWFyY2giOiJidWlsZF9zdGF0dXM6RkFJTFVSRSBBTkQgbWVzc2FnZTpcIkZpbmlzaGVkOiBGQUlMVVJFXCIgQU5EIGJ1aWxkX25hbWU6XCJjaGVjay10ZW1wZXN0LWRzdm0tbmV1dHJvbi1mdWxsXCIgQU5EIGJ1aWxkX2JyYW5jaDpcIm1hc3RlclwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiIxNzI4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsiZnJvbSI6IjIwMTQtMDctMTBUMDA6MjQ6NTcrMDA6MDAiLCJ0byI6IjIwMTQtMDctMTBUMDg6MjQ6NTMrMDA6MDAiLCJ1c2VyX2ludGVydmFsIjoiMCJ9LCJzdGFtcCI6MTQwNDk4MjU2MjM2OCwibW9kZSI6IiIsImFuYWx5emVfZmllbGQiOiIifQ==</a></div>
<div>[3] <a href="http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiSFRUUEJhZFJlcXVlc3Q6IFVucmVjb2duaXplZCBhdHRyaWJ1dGUocykgJ21lbWJlciwgdmlwLCBwb29sLCBoZWFsdGhfbW9uaXRvcidcIiBBTkQgdGFnczpcInNjcmVlbi1xLXN2Yy50eHRcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiY3VzdG9tIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7ImZyb20iOiIyMDE0LTA3LTAxVDA4OjU5OjAxKzAwOjAwIiwidG8iOiIyMDE0LTA3LTEwVDA4OjU5OjAxKzAwOjAwIiwidXNlcl9pbnRlcnZhbCI6IjAifSwic3RhbXAiOjE0MDQ5ODI3OTc3ODAsIm1vZGUiOiIiLCJhbmFseXplX2ZpZWxkIjoiIn0=">http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiSFRUUEJhZFJlcXVlc3Q6IFVucmVjb2duaXplZCBhdHRyaWJ1dGUocykgJ21lbWJlciwgdmlwLCBwb29sLCBoZWFsdGhfbW9uaXRvcidcIiBBTkQgdGFnczpcInNjcmVlbi1xLXN2Yy50eHRcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiY3VzdG9tIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7ImZyb20iOiIyMDE0LTA3LTAxVDA4OjU5OjAxKzAwOjAwIiwidG8iOiIyMDE0LTA3LTEwVDA4OjU5OjAxKzAwOjAwIiwidXNlcl9pbnRlcnZhbCI6IjAifSwic3RhbXAiOjE0MDQ5ODI3OTc3ODAsIm1vZGUiOiIiLCJhbmFseXplX2ZpZWxkIjoiIn0=</a></div>
<div>[4] <a href="https://bugs.launchpad.net/nova/+bug/1333654">https://bugs.launchpad.net/nova/+bug/1333654</a></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On 2 July 2014 17:57, Salvatore Orlando <span dir="ltr"><<a href="mailto:sorlando@nicira.com" target="_blank">sorlando@nicira.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Hi again,<div><br></div><div>From my analysis most of the failures affecting the neutron full job are because of bugs [1] and [2] for which patch [3] and [4] have been proposed.</div>
<div>Both patches address the nova side of the neutron/nova notification system for vif plugging.</div>
<div>It is worth noting that these bugs did manifest only in the neutron full job not because of its "full" nature, but because of its "parallel" nature.</div><div><br></div><div>Openstackers with a good memory will probably remember we fixed the parallel job back in January, before the massive "kernel bug" gate outage [5]. However, since parallel testing was unfortunately never enabled on the smoke job we run on the gate, we allowed new bugs to slip in.</div>
<div>For this reason I would recommend the following:</div><div>- once patches [3] and [4] have been reviewed and merge, re-assess neutron full job failure rate over a period of 48 hours (72 if the period includes at least 24 hours within a weekend - GMT time)</div>
<div>- turn neutron full job to voting if the previous step reveals a failure rate below 10%, otherwise go back to the drawing board</div><div><br></div><div>In my opinion whether the full job should be enabled in an asymmetric fashion or not should be a decision for the QA and Infra teams. Once the full job is made voting there will inevitably be a higher failure rate. An asymmetric gate will not cause backlogs on other projects, so less angry people, but as Matt said it will still allow other bugs to slip in. Personally I'm ok either way.</div>
<div><br></div><div>The reason why we're expecting a higher failure rate on the full job is that we have already observed that some "known" bugs, such as the various lock timeout issues affecting neutron tend to show with a higher frequency on the full job because of its parallel nature.</div>
<div><br></div><div>Salvatore</div><div><br></div><div>[1] <a href="https://launchpad.net/bugs/1329546" target="_blank">https://launchpad.net/bugs/1329546</a></div><div>[2] <a href="https://launchpad.net/bugs/1333654" target="_blank">https://launchpad.net/bugs/1333654</a></div>
<div>[3] <a href="https://review.openstack.org/#/c/99182/" target="_blank">https://review.openstack.org/#/c/99182/</a></div><div>[4] <a href="https://review.openstack.org/#/c/103865/" target="_blank">https://review.openstack.org/#/c/103865/</a></div>
<div>
[5] <a href="https://bugs.launchpad.net/neutron/+bug/1273386" target="_blank">https://bugs.launchpad.net/neutron/+bug/1273386</a><br></div><div><br></div><div><br></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">
<div><div class="h5">On 25 June 2014 23:38, Matthew Treinish <span dir="ltr"><<a href="mailto:mtreinish@kortar.org" target="_blank">mtreinish@kortar.org</a>></span> wrote:<br>
</div></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div class="h5"><div>On Tue, Jun 24, 2014 at 02:14:16PM +0200, Salvatore Orlando wrote:<br>
> There is a long standing patch [1] for enabling the neutron full job.<br>
> Little before the Icehouse release date, when we first pushed this, the<br>
> neutron full job had a failure rate of less than 10%. However, since has<br>
> come by, and perceived failure rates were higher, we ran again this<br>
> analysis.<br>
<br>
</div>So I'm not exactly a fan of having the gates be asymmetrical. It's very easy<br>
for breaks to slip in blocking the neutron gate if it's not voting everywhere.<br>
Especially because I think most people have been trained to ignore the full<br>
job because it's been nonvoting for so long. Is there a particular reason we<br>
just don't switch everything all at once? I think having a little bit of<br>
friction everywhere during the migration is fine. Especially if we do it way<br>
before a milestone. (as opposed to the original parallel switch which was right<br>
before H-3)<br>
<div><div><br>
><br>
> Here are the findings in a nutshell.<br>
> 1) If we were to enable the job today we might expect about a 3-fold<br>
> increase in neutron job failures when compared with the smoke test. This is<br>
> unfortunately not acceptable and we therefore need to identify and fix the<br>
> issues causing the additional failure rate.<br>
> 2) However this also puts us in a position where if we wait until the<br>
> failure rate drops under a given threshold we might end up chasing a moving<br>
> target as new issues might be introduced at any time since the job is not<br>
> voting.<br>
> 3) When it comes to evaluating failure rates for a non voting job, taking<br>
> the rough numbers does not mean anything, as that will take in account<br>
> patches 'in progress' which end up failing the tests because of problems in<br>
> the patch themselves.<br>
><br>
> Well, that was pretty much a lot for a "nutshell"; however if you're not<br>
> yet bored to death please go on reading.<br>
><br>
> The data in this post are a bit skewed because of a rise in neutron job<br>
> failures in the past 36 hours. However, this rise affects both the full and<br>
> the smoke job so it does not invalidate what we say here. The results shown<br>
> below are representative of the gate status 12 hours ago.<br>
><br>
> - Neutron smoke job failure rates (all queues)<br>
> 24 hours: 22.4% 48 hours: 19.3% 7 days: 8.96%<br>
> - Neutron smoke job failure rates (gate queue only):<br>
> 24 hours: 10.41% 48 hours: 10.20% 7 days: 3.53%<br>
> - Neutron full job failure rate (check queue only as it's non voting):<br>
> 24 hours: 31.54% 48 hours: 28.87% 7 days: 25.73%<br>
><br>
> Check/Gate Ratio between neutron smoke failures<br>
> 24 hours: 2.15 48 hours: 1.89 7 days: 2.53<br>
><br>
> Estimated job failure rate for neutron full job if it were to run in the<br>
> gate:<br>
> 24 hours: 14.67% 48 hours: 15.27% 7 days: 10.16%<br>
><br>
> The numbers are therefore not terrible, but definitely not good enough;<br>
> looking at the last 7 days the full job will have a failure rate about 3<br>
> times higher than the smoke job.<br>
><br>
> We then took, as it's usual for us when we do this kind of evaluation, a<br>
> window with a reasonable number of failures (41 in our case), and analysed<br>
> them in detail.<br>
><br>
> Of these 41 failures 17 were excluded because of infra problems, patches<br>
> 'in progress', or other transient failures; considering that over the same<br>
> period of time 160 full job runs succeeded this would leave us with 24<br>
> failures on 184 run, and therefore a failure rate of 13.04%, which not far<br>
> from the estimate.<br>
><br>
> Let's consider now these 24 'real' falures:<br>
> A) 2 were for the SSH timeout (8.33% of failures, 1.08% of total full job<br>
> runs). These specific failure is being analyzed to see if a specific<br>
> fingerprint can be found<br>
> B) 2 (8.33% of failures, 1.08% of total full job runs) were for a failure<br>
> in test load balancer basic, which is actually a test design issue and is<br>
> already being addressed [2]<br>
> C) 7 (29.16% of failures, 3.81% of total full job runs) were for an issue<br>
> while resizing a server, which has been already spotted and has a bug in<br>
> progress [3]<br>
> D) 5 (20.83% of failures, 2.72% of total full job runs) manifested as a<br>
> failure in test_server_address; however the actual root cause was being<br>
> masked by [4]. A bug has been filed [5]; this is the most worrying one in<br>
> my opinion as there are many cases where the fault happens but does not<br>
> trigger a failure because of the way tempest tests are designed.<br>
> E) 6 are because of our friend lock wait timeout. This was initially filed<br>
> as [6] but since then we've closed it to file more detailed bug reports as<br>
> the lock wait timeout can manifest in various places; Eugene is leading the<br>
> effort on this problem with Kevin B.<br>
><br>
><br>
> Summarizing the only failure modes specific to the full job seem to be C &<br>
> D. If we were able to fix those we should reasonably expect a failure rate<br>
> of about 6.5%. That's still almost twice as the smoke job, but I deem it<br>
> acceptable for two reasons:<br>
> 1- by voting, we will avoid new bugs affecting the full job from being<br>
> introduced. it is worth reminding people that any bug affecting the full<br>
> job is likely to affect production environments<br>
<br>
</div></div>+1, this is a very good point.<br>
<div><br>
> 2- patches failing in the gate will spur neutron developers to quickly find<br>
> a fix. Patches failing a non voting job will cause some neutron core team<br>
> members to write long and boring posts to the mailing list.<br>
><br>
<br>
</div>Well, you can always hope. :) But, in my experience the error is often fixed<br>
quickly but the lesson isn't learned, so it will just happen again. That's why<br>
I think we should just grit our teeth and turn it on everywhere.<br>
<div><br>
> Salvatore<br>
><br>
><br>
><br>
><br>
> [1] <a href="https://review.openstack.org/#/c/88289/" target="_blank">https://review.openstack.org/#/c/88289/</a><br>
> [2] <a href="https://review.openstack.org/#/c/98065/" target="_blank">https://review.openstack.org/#/c/98065/</a><br>
> [3] <a href="https://bugs.launchpad.net/nova/+bug/1329546" target="_blank">https://bugs.launchpad.net/nova/+bug/1329546</a><br>
> [4] <a href="https://bugs.launchpad.net/tempest/+bug/1332414" target="_blank">https://bugs.launchpad.net/tempest/+bug/1332414</a><br>
> [5] <a href="https://bugs.launchpad.net/nova/+bug/1333654" target="_blank">https://bugs.launchpad.net/nova/+bug/1333654</a><br>
> [5] <a href="https://bugs.launchpad.net/nova/+bug/1283522" target="_blank">https://bugs.launchpad.net/nova/+bug/1283522</a><br>
<br>
</div>Very cool, thanks for the update Salvatore. I'm very excited to get this voting.<br>
<br>
<br>
-Matt Treinish<br>
<br></div></div>_______________________________________________<br>
OpenStack-dev mailing list<br>
<a href="mailto:OpenStack-dev@lists.openstack.org" target="_blank">OpenStack-dev@lists.openstack.org</a><br>
<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>
<br></blockquote></div><br></div>
</blockquote></div><br></div>