[openstack-dev] [Neutron][QA] Enabling full neutron Job

Salvatore Orlando sorlando at nicira.com
Wed Jul 23 12:40:02 UTC 2014


Here I am again bothering you with the state of the full job for Neutron.

The patch for fixing an issue in nova's server external events extension
merged yesterday [1]
We do not have yet enough data points to make a reliable assessment, but of
out 37 runs since the patch merged, we had "only" 5 failures, which puts
the failure rate at about 13%

This is ugly compared with the current failure rate of the smoketest (3%).
However, I think it is good enough to start making the full job voting at
least for neutron patches.
Once we'll be able to bring down failure rate to anything around 5%, we can
then enable the job everywhere.

As much as I hate asymmetric gating, I think this is a good compromise for
avoiding developers working on other projects are badly affected by the
higher failure rate in the neutron full job.
I will therefore resume work on [2] and remove the WIP status as soon as I
can confirm a failure rate below 15% with more data points.

Salvatore

[1] https://review.openstack.org/#/c/103865/
[2] https://review.openstack.org/#/c/88289/


On 10 July 2014 11:49, Salvatore Orlando <sorlando at nicira.com> wrote:

>
>
>
> On 10 July 2014 11:27, Ihar Hrachyshka <ihrachys at redhat.com> wrote:
>
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA512
>>
>> On 10/07/14 11:07, Salvatore Orlando wrote:
>> > The patch for bug 1329564 [1] merged about 11 hours ago. From [2]
>> > it seems there has been an improvement on the failure rate, which
>> > seem to have dropped to 25% from over 40%. Still, since the patch
>> > merged there have been 11 failures already in the full job out of
>> > 42 jobs executed in total. Of these 11 failures: - 3 were due to
>> > problems in the patches being tested - 1 had the same root cause as
>> > bug 1329564. Indeed the related job started before the patch merged
>> > but finished after. So this failure "doesn't count". - 1 was for an
>> > issue introduced about a week ago which actually causing a lot of
>> > failures in the full job [3]. Fix should be easy for it; however
>> > given the nature of the test we might even skip it while it's
>> > fixed. - 3 were for bug 1333654 [4]; for this bug discussion is
>> > going on on gerrit regarding the most suitable approach. - 3 were
>> > for lock wait timeout errors. Several people in the community are
>> > already working on them. I hope this will raise the profile of this
>> > issue (maybe some might think it's just a corner case as it rarely
>> > causes failures in smoke jobs, whereas the truth is that error
>> > occurs but it does not cause job failure because the jobs isn't
>> > parallel).
>>
>> Can you give directions on where to find those lock timeout failures?
>> I'd like to check logs to see whether they have the same nature as
>> most other failures (e.g. improper yield under transaction).
>>
>
> This logstash query will give you all occurences of lock wait timeout
> issues: message:"(OperationalError) (1205, 'Lock wait timeout exceeded; try
> restarting transaction')" AND tags:"screen-q-svc.txt"
>
> The fact that in most cases the build succeeds anyway is misleading,
> because in many cases these errors occur in RPC handling between agents and
> servers, and therefore are not detected by tempest. The neutron full job,
> which is parallel, increases their occurrence because of parallelism - and
> since API request too occur concurrently it also yields a higher tempest
> build failure rate.
>
> However, as I argued in the past the "lock wait timeout" error should
> always be treated as an error condition.
> Eugene has already classified lock wait timeout failures and filed bugs
> for them a few weeks ago.
>
>
>> >
>> > Summarizing, I think time is not yet ripe to enable the full job;
>> > once bug 1333654 is fixed, we should go for it. AFAIK there is no
>> > way for working around it in gate tests other than disabling
>> > nova/neutron event reporting, which I guess we don't want to do.
>> >
>> > Salvatore
>> >
>> > [1] https://review.openstack.org/#/c/105239 [2]
>> >
>> http://logstash.openstack.org/#eyJzZWFyY2giOiJidWlsZF9zdGF0dXM6RkFJTFVSRSBBTkQgbWVzc2FnZTpcIkZpbmlzaGVkOiBGQUlMVVJFXCIgQU5EIGJ1aWxkX25hbWU6XCJjaGVjay10ZW1wZXN0LWRzdm0tbmV1dHJvbi1mdWxsXCIgQU5EIGJ1aWxkX2JyYW5jaDpcIm1hc3RlclwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiIxNzI4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsiZnJvbSI6IjIwMTQtMDctMTBUMDA6MjQ6NTcrMDA6MDAiLCJ0byI6IjIwMTQtMDctMTBUMDg6MjQ6NTMrMDA6MDAiLCJ1c2VyX2ludGVydmFsIjoiMCJ9LCJzdGFtcCI6MTQwNDk4MjU2MjM2OCwibW9kZSI6IiIsImFuYWx5emVfZmllbGQiOiIifQ==
>> >
>> >
>> [3]
>> >
>> http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiSFRUUEJhZFJlcXVlc3Q6IFVucmVjb2duaXplZCBhdHRyaWJ1dGUocykgJ21lbWJlciwgdmlwLCBwb29sLCBoZWFsdGhfbW9uaXRvcidcIiBBTkQgdGFnczpcInNjcmVlbi1xLXN2Yy50eHRcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiY3VzdG9tIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7ImZyb20iOiIyMDE0LTA3LTAxVDA4OjU5OjAxKzAwOjAwIiwidG8iOiIyMDE0LTA3LTEwVDA4OjU5OjAxKzAwOjAwIiwidXNlcl9pbnRlcnZhbCI6IjAifSwic3RhbXAiOjE0MDQ5ODI3OTc3ODAsIm1vZGUiOiIiLCJhbmFseXplX2ZpZWxkIjoiIn0=
>> >
>> >
>> [4] https://bugs.launchpad.net/nova/+bug/1333654
>> >
>> >
>> > On 2 July 2014 17:57, Salvatore Orlando <sorlando at nicira.com>
>> > wrote:
>> >
>> >> Hi again,
>> >>
>> >> From my analysis most of the failures affecting the neutron full
>> >> job are because of bugs [1] and [2] for which patch [3] and [4]
>> >> have been proposed. Both patches address the nova side of the
>> >> neutron/nova notification system for vif plugging. It is worth
>> >> noting that these bugs did manifest only in the neutron full job
>> >> not because of its "full" nature, but because of its "parallel"
>> >> nature.
>> >>
>> >> Openstackers with a good memory will probably remember we fixed
>> >> the parallel job back in January, before the massive "kernel bug"
>> >> gate outage [5]. However, since parallel testing was
>> >> unfortunately never enabled on the smoke job we run on the gate,
>> >> we allowed new bugs to slip in. For this reason I would recommend
>> >> the following: - once patches [3] and [4] have been reviewed and
>> >> merge, re-assess neutron full job failure rate over a period of
>> >> 48 hours (72 if the period includes at least 24 hours within a
>> >> weekend - GMT time) - turn neutron full job to voting if the
>> >> previous step reveals a failure rate below 10%, otherwise go back
>> >> to the drawing board
>> >>
>> >> In my opinion whether the full job should be enabled in an
>> >> asymmetric fashion or not should be a decision for the QA and
>> >> Infra teams. Once the full job is made voting there will
>> >> inevitably be a higher failure rate. An asymmetric gate will not
>> >> cause backlogs on other projects, so less angry people, but as
>> >> Matt said it will still allow other bugs to slip in. Personally
>> >> I'm ok either way.
>> >>
>> >> The reason why we're expecting a higher failure rate on the full
>> >> job is that we have already observed that some "known" bugs, such
>> >> as the various lock timeout issues affecting neutron tend to show
>> >> with a higher frequency on the full job because of its parallel
>> >> nature.
>> >>
>> >> Salvatore
>> >>
>> >> [1] https://launchpad.net/bugs/1329546 [2]
>> >> https://launchpad.net/bugs/1333654 [3]
>> >> https://review.openstack.org/#/c/99182/ [4]
>> >> https://review.openstack.org/#/c/103865/ [5]
>> >> https://bugs.launchpad.net/neutron/+bug/1273386
>> >>
>> >>
>> >>
>> >>
>> >> On 25 June 2014 23:38, Matthew Treinish <mtreinish at kortar.org>
>> >> wrote:
>> >>
>> >>> On Tue, Jun 24, 2014 at 02:14:16PM +0200, Salvatore Orlando
>> >>> wrote:
>> >>>> There is a long standing patch [1] for enabling the neutron
>> >>>> full job. Little before the Icehouse release date, when we
>> >>>> first pushed this, the neutron full job had a failure rate of
>> >>>> less than 10%. However, since has come by, and perceived
>> >>>> failure rates were higher, we ran again this analysis.
>> >>>
>> >>> So I'm not exactly a fan of having the gates be asymmetrical.
>> >>> It's very easy for breaks to slip in blocking the neutron gate
>> >>> if it's not voting everywhere. Especially because I think most
>> >>> people have been trained to ignore the full job because it's
>> >>> been nonvoting for so long. Is there a particular reason we
>> >>> just don't switch everything all at once? I think having a
>> >>> little bit of friction everywhere during the migration is fine.
>> >>> Especially if we do it way before a milestone. (as opposed to
>> >>> the original parallel switch which was right before H-3)
>> >>>
>> >>>>
>> >>>> Here are the findings in a nutshell. 1) If we were to enable
>> >>>> the job today we might expect about a 3-fold increase in
>> >>>> neutron job failures when compared with the smoke test.
>> >>> This is
>> >>>> unfortunately not acceptable and we therefore need to
>> >>>> identify and fix
>> >>> the
>> >>>> issues causing the additional failure rate. 2) However this
>> >>>> also puts us in a position where if we wait until the failure
>> >>>> rate drops under a given threshold we might end up chasing a
>> >>> moving
>> >>>> target as new issues might be introduced at any time since
>> >>>> the job is
>> >>> not
>> >>>> voting. 3) When it comes to evaluating failure rates for a
>> >>>> non voting job,
>> >>> taking
>> >>>> the rough numbers does not mean anything, as that will take
>> >>>> in account patches 'in progress' which end up failing the
>> >>>> tests because of
>> >>> problems in
>> >>>> the patch themselves.
>> >>>>
>> >>>> Well, that was pretty much a lot for a "nutshell"; however if
>> >>>> you're not yet bored to death please go on reading.
>> >>>>
>> >>>> The data in this post are a bit skewed because of a rise in
>> >>>> neutron job failures in the past 36 hours. However, this rise
>> >>>> affects both the full
>> >>> and
>> >>>> the smoke job so it does not invalidate what we say here. The
>> >>>> results
>> >>> shown
>> >>>> below are representative of the gate status 12 hours ago.
>> >>>>
>> >>>> - Neutron smoke job failure rates (all queues) 24 hours:
>> >>>> 22.4% 48 hours: 19.3% 7 days: 8.96% - Neutron smoke job
>> >>>> failure rates (gate queue only): 24 hours: 10.41% 48 hours:
>> >>>> 10.20% 7 days: 3.53% - Neutron full job failure rate (check
>> >>>> queue only as it's non voting): 24 hours: 31.54% 48 hours:
>> >>>> 28.87% 7 days: 25.73%
>> >>>>
>> >>>> Check/Gate Ratio between neutron smoke failures 24 hours:
>> >>>> 2.15 48 hours: 1.89 7 days: 2.53
>> >>>>
>> >>>> Estimated job failure rate for neutron full job if it were to
>> >>>> run in the gate: 24 hours: 14.67% 48 hours: 15.27% 7 days:
>> >>>> 10.16%
>> >>>>
>> >>>> The numbers are therefore not terrible, but definitely not
>> >>>> good enough; looking at the last 7 days the full job will
>> >>>> have a failure rate about 3 times higher than the smoke job.
>> >>>>
>> >>>> We then took, as it's usual for us when we do this kind of
>> >>>> evaluation, a window with a reasonable number of failures (41
>> >>>> in our case), and
>> >>> analysed
>> >>>> them in detail.
>> >>>>
>> >>>> Of these 41 failures 17 were excluded because of infra
>> >>>> problems, patches 'in progress', or other transient failures;
>> >>>> considering that over the
>> >>> same
>> >>>> period of time 160 full job runs succeeded this would leave
>> >>>> us with 24 failures on 184 run, and therefore a failure rate
>> >>>> of 13.04%, which not
>> >>> far
>> >>>> from the estimate.
>> >>>>
>> >>>> Let's consider now these 24 'real' falures: A)  2 were for
>> >>>> the SSH timeout (8.33% of failures, 1.08% of total full
>> >>> job
>> >>>> runs). These specific failure is being analyzed to see if a
>> >>>> specific fingerprint can be found B) 2  (8.33% of failures,
>> >>>> 1.08% of total full job runs) were for a
>> >>> failure
>> >>>> in test load balancer basic, which is actually a test design
>> >>>> issue and
>> >>> is
>> >>>> already being addressed [2] C) 7 (29.16% of failures, 3.81%
>> >>>> of total full job runs) were for an
>> >>> issue
>> >>>> while resizing a server, which has been already spotted and
>> >>>> has a bug in progress [3] D) 5 (20.83% of failures, 2.72% of
>> >>>> total full job runs) manifested as a failure in
>> >>>> test_server_address; however the actual root cause was being
>> >>>> masked by [4]. A bug has been filed [5]; this is the most
>> >>>> worrying one
>> >>> in
>> >>>> my opinion as there are many cases where the fault happens
>> >>>> but does not trigger a failure because of the way tempest
>> >>>> tests are designed. E) 6 are because of our friend lock wait
>> >>>> timeout. This was initially
>> >>> filed
>> >>>> as [6] but since then we've closed it to file more detailed
>> >>>> bug reports
>> >>> as
>> >>>> the lock wait timeout can manifest in various places; Eugene
>> >>>> is leading
>> >>> the
>> >>>> effort on this problem with Kevin B.
>> >>>>
>> >>>>
>> >>>> Summarizing the only failure modes specific to the full job
>> >>>> seem to be
>> >>> C &
>> >>>> D. If we were able to fix those we should reasonably expect a
>> >>>> failure
>> >>> rate
>> >>>> of about 6.5%. That's still almost twice as the smoke job,
>> >>>> but I deem it acceptable for two reasons: 1- by voting, we
>> >>>> will avoid new bugs affecting the full job from being
>> >>>> introduced. it is worth reminding people that any bug
>> >>>> affecting the full job is likely to affect production
>> >>>> environments
>> >>>
>> >>> +1, this is a very good point.
>> >>>
>> >>>> 2- patches failing in the gate will spur neutron developers
>> >>>> to quickly
>> >>> find
>> >>>> a fix. Patches failing a non voting job will cause some
>> >>>> neutron core
>> >>> team
>> >>>> members to write long and boring posts to the mailing list.
>> >>>>
>> >>>
>> >>> Well, you can always hope. :) But, in my experience the error
>> >>> is often fixed quickly but the lesson isn't learned, so it will
>> >>> just happen again. That's why I think we should just grit our
>> >>> teeth and turn it on everywhere.
>> >>>
>> >>>> Salvatore
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> [1] https://review.openstack.org/#/c/88289/ [2]
>> >>>> https://review.openstack.org/#/c/98065/ [3]
>> >>>> https://bugs.launchpad.net/nova/+bug/1329546 [4]
>> >>>> https://bugs.launchpad.net/tempest/+bug/1332414 [5]
>> >>>> https://bugs.launchpad.net/nova/+bug/1333654 [5]
>> >>>> https://bugs.launchpad.net/nova/+bug/1283522
>> >>>
>> >>> Very cool, thanks for the update Salvatore. I'm very excited to
>> >>> get this voting.
>> >>>
>> >>>
>> >>> -Matt Treinish
>> >>>
>> >>> _______________________________________________ OpenStack-dev
>> >>> mailing list OpenStack-dev at lists.openstack.org
>> >>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>> >>>
>> >>>
>> >>
>> >
>> >>>
>> >
>> >
>> > _______________________________________________ OpenStack-dev
>> > mailing list OpenStack-dev at lists.openstack.org
>> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>> >
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG/MacGPG2 v2.0.22 (Darwin)
>> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>>
>> iQEcBAEBCgAGBQJTvlxzAAoJEC5aWaUY1u57FJ8H/i+gPR/VZuWFvkOu7pNTHuSj
>> 8iSA1LJRGe7I9185Gbh22fVzGlahqDpB2hCJjKtWIcL/ml/pgSNGzafB/DhqUUlL
>> 4GT1UUHptqlKaNX9GLl9I/bknUBEtpwg3hSBivVdCkRYiVwfX86a2ZeeHaCAONwY
>> ykhiNgoXhR6mr8oEJEIvtjnTDlodR+1dcEq+Nchf/6Fzd8J29dI2Qu38JkweK/qP
>> m6koPdKSJFzrneOWMCW0Dta6yBKjb3bMCNJUVO/KSGg+MRuSmrufOmLCW5JFu95S
>> DWIQSTWs3A+dSy9+xuByClQP9kDpG3aUXxW6uRu5UshHMAF5vLATmdCdK4kBiBY=
>> =K9qm
>> -----END PGP SIGNATURE-----
>>
>> _______________________________________________
>> OpenStack-dev mailing list
>> OpenStack-dev at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140723/d6e037e4/attachment.html>


More information about the OpenStack-dev mailing list