[openstack-dev] [Neutron][QA] Enabling full neutron Job

Matthew Treinish mtreinish at kortar.org
Wed Jun 25 21:38:54 UTC 2014


On Tue, Jun 24, 2014 at 02:14:16PM +0200, Salvatore Orlando wrote:
> There is a long standing patch [1] for enabling the neutron full job.
> Little before the Icehouse release date, when we first pushed this, the
> neutron full job had a failure rate of less than 10%. However, since has
> come by, and perceived failure rates were higher, we ran again this
> analysis.

So I'm not exactly a fan of having the gates be asymmetrical.  It's very easy
for breaks to slip in blocking the neutron gate if it's not voting everywhere.
Especially because I think most people have been trained to ignore the full
job because it's been nonvoting for so long. Is there a particular reason we
just don't switch everything all at once? I think having a little bit of
friction everywhere during the migration is fine. Especially if we do it way
before a milestone. (as opposed to the original parallel switch which was right
before H-3)

> 
> Here are the findings in a nutshell.
> 1) If we were to enable the job today we might expect about a 3-fold
> increase in neutron job failures when compared with the smoke test. This is
> unfortunately not acceptable and we therefore need to identify and fix the
> issues causing the additional failure rate.
> 2) However this also puts us in a position where if we wait until the
> failure rate drops under a given threshold we might end up chasing a moving
> target as new issues might be introduced at any time since the job is not
> voting.
> 3) When it comes to evaluating failure rates for a non voting job, taking
> the rough numbers does not mean anything, as that will take in account
> patches 'in progress' which end up failing the tests because of problems in
> the patch themselves.
> 
> Well, that was pretty much a lot for a "nutshell"; however if you're not
> yet bored to death please go on reading.
> 
> The data in this post are a bit skewed because of a rise in neutron job
> failures in the past 36 hours. However, this rise affects both the full and
> the smoke job so it does not invalidate what we say here. The results shown
> below are representative of the gate status 12 hours ago.
> 
> - Neutron smoke job failure rates (all queues)
>   24 hours: 22.4% 48 hours: 19.3% 7 days: 8.96%
> - Neutron smoke job failure rates (gate queue only):
>   24 hours: 10.41% 48 hours: 10.20% 7 days: 3.53%
> - Neutron full job failure rate (check queue only as it's non voting):
>   24 hours: 31.54% 48 hours: 28.87% 7 days: 25.73%
> 
> Check/Gate Ratio between neutron smoke failures
> 24 hours: 2.15 48 hours: 1.89 7 days: 2.53
> 
> Estimated job failure rate for neutron full job if it were to run in the
> gate:
> 24 hours: 14.67% 48 hours: 15.27% 7 days: 10.16%
> 
> The numbers are therefore not terrible, but definitely not good enough;
> looking at the last 7 days the full job will have a failure rate about 3
> times higher than the smoke job.
> 
> We then took, as it's usual for us when we do this kind of evaluation, a
> window with a reasonable number of failures (41 in our case), and analysed
> them in detail.
> 
> Of these 41 failures 17 were excluded because of infra problems, patches
> 'in progress', or other transient failures; considering that over the same
> period of time 160 full job runs succeeded this would leave us with 24
> failures on 184 run, and therefore a failure rate of 13.04%, which not far
> from the estimate.
> 
> Let's consider now these 24 'real' falures:
> A)  2 were for the SSH timeout (8.33% of failures, 1.08% of total full job
> runs). These specific failure is being analyzed to see if a specific
> fingerprint can be found
> B) 2  (8.33% of failures, 1.08% of total full job runs) were for a failure
> in test load balancer basic, which is actually a test design issue and is
> already being addressed [2]
> C) 7 (29.16% of failures, 3.81% of total full job runs) were for an issue
> while resizing a server, which has been already spotted and has a bug in
> progress [3]
> D) 5 (20.83% of failures, 2.72% of total full job runs) manifested as a
> failure in test_server_address; however the actual root cause was being
> masked by [4]. A bug has been filed [5]; this is the most worrying one in
> my opinion as there are many cases where the fault happens but does not
> trigger a failure because of the way tempest tests are designed.
> E) 6 are because of our friend lock wait timeout. This was initially filed
> as [6] but since then we've closed it to file more detailed bug reports as
> the lock wait timeout can manifest in various places; Eugene is leading the
> effort on this problem with Kevin B.
> 
> 
> Summarizing the only failure modes specific to the full job seem to be C &
> D. If we were able to fix those we should reasonably expect a failure rate
> of about 6.5%. That's still almost twice as the smoke job, but I deem it
> acceptable for two reasons:
> 1- by voting, we will avoid new bugs affecting the full job from being
> introduced. it is worth reminding people that any bug affecting the full
> job is likely to affect production environments

+1, this is a very good point. 

> 2- patches failing in the gate will spur neutron developers to quickly find
> a fix. Patches failing a non voting job will cause some neutron core team
> members to write long and boring posts to the mailing list.
> 

Well, you can always hope. :) But, in my experience the error is often fixed
quickly but the lesson isn't learned, so it will just happen again. That's why
I think we should just grit our teeth and turn it on everywhere.

> Salvatore
> 
> 
> 
> 
> [1] https://review.openstack.org/#/c/88289/
> [2] https://review.openstack.org/#/c/98065/
> [3] https://bugs.launchpad.net/nova/+bug/1329546
> [4] https://bugs.launchpad.net/tempest/+bug/1332414
> [5] https://bugs.launchpad.net/nova/+bug/1333654
> [5] https://bugs.launchpad.net/nova/+bug/1283522

Very cool, thanks for the update Salvatore. I'm very excited to get this voting.


-Matt Treinish
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140625/06a433c9/attachment.pgp>


More information about the OpenStack-dev mailing list