[openstack-dev] [Neutron][QA] Enabling full neutron Job

Salvatore Orlando sorlando at nicira.com
Thu Jul 10 09:07:26 UTC 2014


The patch for bug 1329564 [1] merged about 11 hours ago.
>From [2] it seems there has been an improvement on the failure rate, which
seem to have dropped to 25% from over 40%.
Still, since the patch merged there have been 11 failures already in the
full job out of 42 jobs executed in total.
Of these 11 failures:
- 3 were due to problems in the patches being tested
- 1 had the same root cause as bug 1329564. Indeed the related job started
before the patch merged but finished after. So this failure "doesn't count".
- 1 was for an issue introduced about a week ago which actually causing a
lot of failures in the full job [3]. Fix should be easy for it; however
given the nature of the test we might even skip it while it's fixed.
- 3 were for bug 1333654 [4]; for this bug discussion is going on on gerrit
regarding the most suitable approach.
- 3 were for lock wait timeout errors. Several people in the community are
already working on them. I hope this will raise the profile of this issue
(maybe some might think it's just a corner case as it rarely causes
failures in smoke jobs, whereas the truth is that error occurs but it does
not cause job failure because the jobs isn't parallel).

Summarizing, I think time is not yet ripe to enable the full job; once bug
1333654 is fixed, we should go for it. AFAIK there is no way for working
around it in gate tests other than disabling nova/neutron event reporting,
which I guess we don't want to do.

Salvatore

[1] https://review.openstack.org/#/c/105239
[2]
http://logstash.openstack.org/#eyJzZWFyY2giOiJidWlsZF9zdGF0dXM6RkFJTFVSRSBBTkQgbWVzc2FnZTpcIkZpbmlzaGVkOiBGQUlMVVJFXCIgQU5EIGJ1aWxkX25hbWU6XCJjaGVjay10ZW1wZXN0LWRzdm0tbmV1dHJvbi1mdWxsXCIgQU5EIGJ1aWxkX2JyYW5jaDpcIm1hc3RlclwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiIxNzI4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsiZnJvbSI6IjIwMTQtMDctMTBUMDA6MjQ6NTcrMDA6MDAiLCJ0byI6IjIwMTQtMDctMTBUMDg6MjQ6NTMrMDA6MDAiLCJ1c2VyX2ludGVydmFsIjoiMCJ9LCJzdGFtcCI6MTQwNDk4MjU2MjM2OCwibW9kZSI6IiIsImFuYWx5emVfZmllbGQiOiIifQ==
[3]
http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiSFRUUEJhZFJlcXVlc3Q6IFVucmVjb2duaXplZCBhdHRyaWJ1dGUocykgJ21lbWJlciwgdmlwLCBwb29sLCBoZWFsdGhfbW9uaXRvcidcIiBBTkQgdGFnczpcInNjcmVlbi1xLXN2Yy50eHRcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiY3VzdG9tIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7ImZyb20iOiIyMDE0LTA3LTAxVDA4OjU5OjAxKzAwOjAwIiwidG8iOiIyMDE0LTA3LTEwVDA4OjU5OjAxKzAwOjAwIiwidXNlcl9pbnRlcnZhbCI6IjAifSwic3RhbXAiOjE0MDQ5ODI3OTc3ODAsIm1vZGUiOiIiLCJhbmFseXplX2ZpZWxkIjoiIn0=
[4] https://bugs.launchpad.net/nova/+bug/1333654


On 2 July 2014 17:57, Salvatore Orlando <sorlando at nicira.com> wrote:

> Hi again,
>
> From my analysis most of the failures affecting the neutron full job are
> because of bugs [1] and [2] for which patch [3] and [4] have been proposed.
> Both patches address the nova side of the neutron/nova notification system
> for vif plugging.
> It is worth noting that these bugs did manifest only in the neutron full
> job not because of its "full" nature, but because of its "parallel" nature.
>
> Openstackers with a good memory will probably remember we fixed the
> parallel job back in January, before the massive "kernel bug" gate outage
> [5]. However, since parallel testing was unfortunately never enabled on the
> smoke job we run on the gate, we allowed new bugs to slip in.
> For this reason I would recommend the following:
> - once patches [3] and [4] have been reviewed and merge, re-assess neutron
> full job failure rate over a period of 48 hours (72 if the period includes
> at least 24 hours within a weekend - GMT time)
> - turn neutron full job to voting if the previous step reveals a failure
> rate below 10%, otherwise go back to the drawing board
>
> In my opinion whether the full job should be enabled in an asymmetric
> fashion or not should be a decision for the QA and Infra teams. Once the
> full job is made voting there will inevitably be a higher failure rate. An
> asymmetric gate will not cause backlogs on other projects, so less angry
> people, but as Matt said it will still allow other bugs to slip in.
> Personally I'm ok either way.
>
> The reason why we're expecting a higher failure rate on the full job is
> that we have already observed that some "known" bugs, such as the various
> lock timeout issues affecting neutron tend to show with a higher frequency
> on the full job because of its parallel nature.
>
> Salvatore
>
> [1] https://launchpad.net/bugs/1329546
> [2] https://launchpad.net/bugs/1333654
> [3] https://review.openstack.org/#/c/99182/
> [4] https://review.openstack.org/#/c/103865/
> [5] https://bugs.launchpad.net/neutron/+bug/1273386
>
>
>
>
> On 25 June 2014 23:38, Matthew Treinish <mtreinish at kortar.org> wrote:
>
>> On Tue, Jun 24, 2014 at 02:14:16PM +0200, Salvatore Orlando wrote:
>> > There is a long standing patch [1] for enabling the neutron full job.
>> > Little before the Icehouse release date, when we first pushed this, the
>> > neutron full job had a failure rate of less than 10%. However, since has
>> > come by, and perceived failure rates were higher, we ran again this
>> > analysis.
>>
>> So I'm not exactly a fan of having the gates be asymmetrical.  It's very
>> easy
>> for breaks to slip in blocking the neutron gate if it's not voting
>> everywhere.
>> Especially because I think most people have been trained to ignore the
>> full
>> job because it's been nonvoting for so long. Is there a particular reason
>> we
>> just don't switch everything all at once? I think having a little bit of
>> friction everywhere during the migration is fine. Especially if we do it
>> way
>> before a milestone. (as opposed to the original parallel switch which was
>> right
>> before H-3)
>>
>> >
>> > Here are the findings in a nutshell.
>> > 1) If we were to enable the job today we might expect about a 3-fold
>> > increase in neutron job failures when compared with the smoke test.
>> This is
>> > unfortunately not acceptable and we therefore need to identify and fix
>> the
>> > issues causing the additional failure rate.
>> > 2) However this also puts us in a position where if we wait until the
>> > failure rate drops under a given threshold we might end up chasing a
>> moving
>> > target as new issues might be introduced at any time since the job is
>> not
>> > voting.
>> > 3) When it comes to evaluating failure rates for a non voting job,
>> taking
>> > the rough numbers does not mean anything, as that will take in account
>> > patches 'in progress' which end up failing the tests because of
>> problems in
>> > the patch themselves.
>> >
>> > Well, that was pretty much a lot for a "nutshell"; however if you're not
>> > yet bored to death please go on reading.
>> >
>> > The data in this post are a bit skewed because of a rise in neutron job
>> > failures in the past 36 hours. However, this rise affects both the full
>> and
>> > the smoke job so it does not invalidate what we say here. The results
>> shown
>> > below are representative of the gate status 12 hours ago.
>> >
>> > - Neutron smoke job failure rates (all queues)
>> >   24 hours: 22.4% 48 hours: 19.3% 7 days: 8.96%
>> > - Neutron smoke job failure rates (gate queue only):
>> >   24 hours: 10.41% 48 hours: 10.20% 7 days: 3.53%
>> > - Neutron full job failure rate (check queue only as it's non voting):
>> >   24 hours: 31.54% 48 hours: 28.87% 7 days: 25.73%
>> >
>> > Check/Gate Ratio between neutron smoke failures
>> > 24 hours: 2.15 48 hours: 1.89 7 days: 2.53
>> >
>> > Estimated job failure rate for neutron full job if it were to run in the
>> > gate:
>> > 24 hours: 14.67% 48 hours: 15.27% 7 days: 10.16%
>> >
>> > The numbers are therefore not terrible, but definitely not good enough;
>> > looking at the last 7 days the full job will have a failure rate about 3
>> > times higher than the smoke job.
>> >
>> > We then took, as it's usual for us when we do this kind of evaluation, a
>> > window with a reasonable number of failures (41 in our case), and
>> analysed
>> > them in detail.
>> >
>> > Of these 41 failures 17 were excluded because of infra problems, patches
>> > 'in progress', or other transient failures; considering that over the
>> same
>> > period of time 160 full job runs succeeded this would leave us with 24
>> > failures on 184 run, and therefore a failure rate of 13.04%, which not
>> far
>> > from the estimate.
>> >
>> > Let's consider now these 24 'real' falures:
>> > A)  2 were for the SSH timeout (8.33% of failures, 1.08% of total full
>> job
>> > runs). These specific failure is being analyzed to see if a specific
>> > fingerprint can be found
>> > B) 2  (8.33% of failures, 1.08% of total full job runs) were for a
>> failure
>> > in test load balancer basic, which is actually a test design issue and
>> is
>> > already being addressed [2]
>> > C) 7 (29.16% of failures, 3.81% of total full job runs) were for an
>> issue
>> > while resizing a server, which has been already spotted and has a bug in
>> > progress [3]
>> > D) 5 (20.83% of failures, 2.72% of total full job runs) manifested as a
>> > failure in test_server_address; however the actual root cause was being
>> > masked by [4]. A bug has been filed [5]; this is the most worrying one
>> in
>> > my opinion as there are many cases where the fault happens but does not
>> > trigger a failure because of the way tempest tests are designed.
>> > E) 6 are because of our friend lock wait timeout. This was initially
>> filed
>> > as [6] but since then we've closed it to file more detailed bug reports
>> as
>> > the lock wait timeout can manifest in various places; Eugene is leading
>> the
>> > effort on this problem with Kevin B.
>> >
>> >
>> > Summarizing the only failure modes specific to the full job seem to be
>> C &
>> > D. If we were able to fix those we should reasonably expect a failure
>> rate
>> > of about 6.5%. That's still almost twice as the smoke job, but I deem it
>> > acceptable for two reasons:
>> > 1- by voting, we will avoid new bugs affecting the full job from being
>> > introduced. it is worth reminding people that any bug affecting the full
>> > job is likely to affect production environments
>>
>> +1, this is a very good point.
>>
>> > 2- patches failing in the gate will spur neutron developers to quickly
>> find
>> > a fix. Patches failing a non voting job will cause some neutron core
>> team
>> > members to write long and boring posts to the mailing list.
>> >
>>
>> Well, you can always hope. :) But, in my experience the error is often
>> fixed
>> quickly but the lesson isn't learned, so it will just happen again.
>> That's why
>> I think we should just grit our teeth and turn it on everywhere.
>>
>> > Salvatore
>> >
>> >
>> >
>> >
>> > [1] https://review.openstack.org/#/c/88289/
>> > [2] https://review.openstack.org/#/c/98065/
>> > [3] https://bugs.launchpad.net/nova/+bug/1329546
>> > [4] https://bugs.launchpad.net/tempest/+bug/1332414
>> > [5] https://bugs.launchpad.net/nova/+bug/1333654
>> > [5] https://bugs.launchpad.net/nova/+bug/1283522
>>
>> Very cool, thanks for the update Salvatore. I'm very excited to get this
>> voting.
>>
>>
>> -Matt Treinish
>>
>> _______________________________________________
>> OpenStack-dev mailing list
>> OpenStack-dev at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140710/e3f259d9/attachment.html>


More information about the OpenStack-dev mailing list