[openstack-dev] [qa] [neutron] Neutron Full Parallel job very close to voting - call to arms by neutron team

Salvatore Orlando sorlando at nicira.com
Sun Feb 23 02:14:01 UTC 2014


I have tried to collect more information on neutron full job failures.

So far there have been 219 failures and 891 successes, for an overall
success rate of 19.8% which is inline with Sean's evaluation.
The count has performed exclusively on jobs executed against master branch.
The failure rate for stable/havana is higher; indeed the job there still
triggers bug 1273386 as it performs nbd mounting, and several fixes for the
l2/l3 agents were not backported (or not backportable).

It is worth noting that actually some of the failures were because of infra
issues. Unfortunately, it is not obvious to me how to define a logstash
query for that. Nevertheless, it will be better to err on the side of
safety and estimate failure rate to be about 20%.

I did then a classification of 63 failures, finding out the following:
- 25 failures were for infra issues, 1 failure was due to a flaw in a
patch, leaving 37 "real" failures to analyse
   * In the same timeframe 203 jobs succeeded, giving a potential failure
rate after excluding infra issues of 15.7%
- 2 bugs were responsible for 25 of these 37 failures
   * they are the "SSH protocol banner issue", and the well-knows DB lock
timeouts
- bug 1253896 (the infamous SSH timeout bug) was hit only twice. The
elastic recheck count is much higher because failures for the SSH protocol
banner error (1265495) are being classified as bug 1253896.
   * actually in the past 48 hours only 2 voting neutron jobs hit this
failure. This is probably a great improvement compared with a few weeks ago.
- Some failures are due to bug already known and tracked, other failures
are due to bugs either unforeseen so far or not tracked. In the latter case
a bug report has been filed.

It seems therefore that there are two high priority bugs to address:
1) https://bugs.launchpad.net/neutron/+bug/1283522 (16 occurrences, 43.2%
of failure, 6.67% globally)
    * Check whether we can resume the split between API server and RPC
server discussion)
2) https://bugs.launchpad.net/neutron/+bug/1265495 (9/37 = 24.3% of
failures, 3.75% globally)

And several minor bugs (affecting tempest and/or neutron)
Each one of the following bugs was found no more than twice in our analysis:
3) https://bugs.launchpad.net/neutron/+bug/1254890 (possibly a nova bug,
but it hit the neutron full job once)
4) https://bugs.launchpad.net/neutron/+bug/1283599
5) https://bugs.launchpad.net/neutron/+bug/1277439
6) https://bugs.launchpad.net/neutron/+bug/1253896
7) https://bugs.launchpad.net/neutron/+bug/1253533
8) https://bugs.launchpad.net/tempest/+bug/1283535 (possibly not a neutron
bug)
9) https://bugs.launchpad.net/tempest/+bug/1253993 (need to devise new
solutions for improving agent loop times)
   * there is already a patch under review for bulking device details
requests
10) https://bugs.launchpad.net/neutron/+bug/1283518

In my humble opinion, it is therefore important to have immediately a plan
for ensuring bugs #1 and #2 are solved or at least consistently mitigated
by icehouse. It would also be good to identify assignees for bug #3 to bug
#10.

Regards,
Salvatore


On 21 February 2014 14:44, Sean Dague <sean at dague.net> wrote:

> Yesterday during the QA meeting we realized that the neutron full job,
> which includes tenant isolation, and full parallelism, was passing quite
> often in the experimental queue. Which was actually news to most of us,
> as no one had been keeping a close eye on it.
>
> I moved that to a non-voting job on all projects. A spot check overnight
> is that it's failing about twice as often as the regular neutron job.
> Which is too high a failure rate to make it voting, but it's close.
>
> This would be the time for a final hard push by the neutron team to get
> to the bottom of these failures to bring the pass rate to the level of
> the existing neutron job, then we could make neutron full voting.
>
> This is a *huge* move forward from where things were at the Havana
> summit. I want to thank the Neutron team for getting so aggressive about
> getting this testing working. I was skeptical we could get there within
> the cycle, but a last push could actually get us neutron parity in the
> gate by i3.
>
>         -Sean
>
> --
> Sean Dague
> Samsung Research America
> sean at dague.net / sean.dague at samsung.com
> http://dague.net
>
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140223/0848f255/attachment.html>


More information about the OpenStack-dev mailing list