[openstack-dev] [Neutron][QA] Enabling full neutron Job

Salvatore Orlando sorlando at nicira.com
Tue Jun 24 12:38:13 UTC 2014


Ops...  I forgot to mention that in agreement with sdague we won't anyway
enable this job before thursday June 26th, in order to give a few days to
the trusty update to settle down.

Salvatore


On 24 June 2014 14:14, Salvatore Orlando <sorlando at nicira.com> wrote:

> There is a long standing patch [1] for enabling the neutron full job.
> Little before the Icehouse release date, when we first pushed this, the
> neutron full job had a failure rate of less than 10%. However, since has
> come by, and perceived failure rates were higher, we ran again this
> analysis.
>
> Here are the findings in a nutshell.
> 1) If we were to enable the job today we might expect about a 3-fold
> increase in neutron job failures when compared with the smoke test. This is
> unfortunately not acceptable and we therefore need to identify and fix the
> issues causing the additional failure rate.
> 2) However this also puts us in a position where if we wait until the
> failure rate drops under a given threshold we might end up chasing a moving
> target as new issues might be introduced at any time since the job is not
> voting.
> 3) When it comes to evaluating failure rates for a non voting job, taking
> the rough numbers does not mean anything, as that will take in account
> patches 'in progress' which end up failing the tests because of problems in
> the patch themselves.
>
> Well, that was pretty much a lot for a "nutshell"; however if you're not
> yet bored to death please go on reading.
>
> The data in this post are a bit skewed because of a rise in neutron job
> failures in the past 36 hours. However, this rise affects both the full and
> the smoke job so it does not invalidate what we say here. The results shown
> below are representative of the gate status 12 hours ago.
>
> - Neutron smoke job failure rates (all queues)
>   24 hours: 22.4% 48 hours: 19.3% 7 days: 8.96%
> - Neutron smoke job failure rates (gate queue only):
>   24 hours: 10.41% 48 hours: 10.20% 7 days: 3.53%
> - Neutron full job failure rate (check queue only as it's non voting):
>   24 hours: 31.54% 48 hours: 28.87% 7 days: 25.73%
>
> Check/Gate Ratio between neutron smoke failures
> 24 hours: 2.15 48 hours: 1.89 7 days: 2.53
>
> Estimated job failure rate for neutron full job if it were to run in the
> gate:
> 24 hours: 14.67% 48 hours: 15.27% 7 days: 10.16%
>
> The numbers are therefore not terrible, but definitely not good enough;
> looking at the last 7 days the full job will have a failure rate about 3
> times higher than the smoke job.
>
> We then took, as it's usual for us when we do this kind of evaluation, a
> window with a reasonable number of failures (41 in our case), and analysed
> them in detail.
>
> Of these 41 failures 17 were excluded because of infra problems, patches
> 'in progress', or other transient failures; considering that over the same
> period of time 160 full job runs succeeded this would leave us with 24
> failures on 184 run, and therefore a failure rate of 13.04%, which not far
> from the estimate.
>
> Let's consider now these 24 'real' falures:
> A)  2 were for the SSH timeout (8.33% of failures, 1.08% of total full job
> runs). These specific failure is being analyzed to see if a specific
> fingerprint can be found
> B) 2  (8.33% of failures, 1.08% of total full job runs) were for a failure
> in test load balancer basic, which is actually a test design issue and is
> already being addressed [2]
> C) 7 (29.16% of failures, 3.81% of total full job runs) were for an issue
> while resizing a server, which has been already spotted and has a bug in
> progress [3]
> D) 5 (20.83% of failures, 2.72% of total full job runs) manifested as a
> failure in test_server_address; however the actual root cause was being
> masked by [4]. A bug has been filed [5]; this is the most worrying one in
> my opinion as there are many cases where the fault happens but does not
> trigger a failure because of the way tempest tests are designed.
> E) 6 are because of our friend lock wait timeout. This was initially filed
> as [6] but since then we've closed it to file more detailed bug reports as
> the lock wait timeout can manifest in various places; Eugene is leading the
> effort on this problem with Kevin B.
>
>
> Summarizing the only failure modes specific to the full job seem to be C &
> D. If we were able to fix those we should reasonably expect a failure rate
> of about 6.5%. That's still almost twice as the smoke job, but I deem it
> acceptable for two reasons:
> 1- by voting, we will avoid new bugs affecting the full job from being
> introduced. it is worth reminding people that any bug affecting the full
> job is likely to affect production environments
> 2- patches failing in the gate will spur neutron developers to quickly
> find a fix. Patches failing a non voting job will cause some neutron core
> team members to write long and boring posts to the mailing list.
>
> Salvatore
>
>
>
>
> [1] https://review.openstack.org/#/c/88289/
> [2] https://review.openstack.org/#/c/98065/
> [3] https://bugs.launchpad.net/nova/+bug/1329546
> [4] https://bugs.launchpad.net/tempest/+bug/1332414
> [5] https://bugs.launchpad.net/nova/+bug/1333654
> [5] https://bugs.launchpad.net/nova/+bug/1283522
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140624/f2886d0c/attachment.html>


More information about the OpenStack-dev mailing list