Open Stack

Tue Jun 24 12:14:16 UTC 2014

There is a long standing patch [1] for enabling the neutron full job.
Little before the Icehouse release date, when we first pushed this, the
neutron full job had a failure rate of less than 10%. However, since has
come by, and perceived failure rates were higher, we ran again this
analysis.

Here are the findings in a nutshell.
1) If we were to enable the job today we might expect about a 3-fold
increase in neutron job failures when compared with the smoke test. This is
unfortunately not acceptable and we therefore need to identify and fix the
issues causing the additional failure rate.
2) However this also puts us in a position where if we wait until the
failure rate drops under a given threshold we might end up chasing a moving
target as new issues might be introduced at any time since the job is not
voting.
3) When it comes to evaluating failure rates for a non voting job, taking
the rough numbers does not mean anything, as that will take in account
patches 'in progress' which end up failing the tests because of problems in
the patch themselves.

Well, that was pretty much a lot for a "nutshell"; however if you're not
yet bored to death please go on reading.

The data in this post are a bit skewed because of a rise in neutron job
failures in the past 36 hours. However, this rise affects both the full and
the smoke job so it does not invalidate what we say here. The results shown
below are representative of the gate status 12 hours ago.

- Neutron smoke job failure rates (all queues)
  24 hours: 22.4% 48 hours: 19.3% 7 days: 8.96%
- Neutron smoke job failure rates (gate queue only):
  24 hours: 10.41% 48 hours: 10.20% 7 days: 3.53%
- Neutron full job failure rate (check queue only as it's non voting):
  24 hours: 31.54% 48 hours: 28.87% 7 days: 25.73%

Check/Gate Ratio between neutron smoke failures
24 hours: 2.15 48 hours: 1.89 7 days: 2.53

Estimated job failure rate for neutron full job if it were to run in the
gate:
24 hours: 14.67% 48 hours: 15.27% 7 days: 10.16%

The numbers are therefore not terrible, but definitely not good enough;
looking at the last 7 days the full job will have a failure rate about 3
times higher than the smoke job.

We then took, as it's usual for us when we do this kind of evaluation, a
window with a reasonable number of failures (41 in our case), and analysed
them in detail.

Of these 41 failures 17 were excluded because of infra problems, patches
'in progress', or other transient failures; considering that over the same
period of time 160 full job runs succeeded this would leave us with 24
failures on 184 run, and therefore a failure rate of 13.04%, which not far
from the estimate.

Let's consider now these 24 'real' falures:
A)  2 were for the SSH timeout (8.33% of failures, 1.08% of total full job
runs). These specific failure is being analyzed to see if a specific
fingerprint can be found
B) 2  (8.33% of failures, 1.08% of total full job runs) were for a failure
in test load balancer basic, which is actually a test design issue and is
already being addressed [2]
C) 7 (29.16% of failures, 3.81% of total full job runs) were for an issue
while resizing a server, which has been already spotted and has a bug in
progress [3]
D) 5 (20.83% of failures, 2.72% of total full job runs) manifested as a
failure in test_server_address; however the actual root cause was being
masked by [4]. A bug has been filed [5]; this is the most worrying one in
my opinion as there are many cases where the fault happens but does not
trigger a failure because of the way tempest tests are designed.
E) 6 are because of our friend lock wait timeout. This was initially filed
as [6] but since then we've closed it to file more detailed bug reports as
the lock wait timeout can manifest in various places; Eugene is leading the
effort on this problem with Kevin B.

Summarizing the only failure modes specific to the full job seem to be C &
D. If we were able to fix those we should reasonably expect a failure rate
of about 6.5%. That's still almost twice as the smoke job, but I deem it
acceptable for two reasons:
1- by voting, we will avoid new bugs affecting the full job from being
introduced. it is worth reminding people that any bug affecting the full
job is likely to affect production environments
2- patches failing in the gate will spur neutron developers to quickly find
a fix. Patches failing a non voting job will cause some neutron core team
members to write long and boring posts to the mailing list.

Salvatore

[1] https://review.openstack.org/#/c/88289/
[2] https://review.openstack.org/#/c/98065/
[3] https://bugs.launchpad.net/nova/+bug/1329546
[4] https://bugs.launchpad.net/tempest/+bug/1332414
[5] https://bugs.launchpad.net/nova/+bug/1333654
[5] https://bugs.launchpad.net/nova/+bug/1283522
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140624/e449e2b1/attachment.html>

Open Stack

[openstack-dev] [Neutron][QA] Enabling full neutron Job

OpenStack

Community

Documentation

Branding & Legal