[openstack-dev] [qa] [neutron] Neutron Full Parallel job very close to voting - call to arms by neutron team
Rossella Sblendido
rsblendido at suse.com
Mon Feb 24 11:14:48 UTC 2014
Ciao Salvatore,
thanks a lot for analyzing the failures!
This link is not working for me:
7) https://bugs.launchpad.net/neutron/+bug/1253533
I took a minor bug that was not assigned. Most of the bugs are assigned
to you, I was wondering if you´d use some help. I guess we can
coordinate better when you are online.
cheers,
Rossella
On 02/23/2014 03:14 AM, Salvatore Orlando wrote:
> I have tried to collect more information on neutron full job failures.
>
> So far there have been 219 failures and 891 successes, for an overall
> success rate of 19.8% which is inline with Sean's evaluation.
> The count has performed exclusively on jobs executed against master
> branch. The failure rate for stable/havana is higher; indeed the job
> there still triggers bug 1273386 as it performs nbd mounting, and
> several fixes for the l2/l3 agents were not backported (or not
> backportable).
>
> It is worth noting that actually some of the failures were because of
> infra issues. Unfortunately, it is not obvious to me how to define a
> logstash query for that. Nevertheless, it will be better to err on the
> side of safety and estimate failure rate to be about 20%.
>
> I did then a classification of 63 failures, finding out the following:
> - 25 failures were for infra issues, 1 failure was due to a flaw in a
> patch, leaving 37 "real" failures to analyse
> * In the same timeframe 203 jobs succeeded, giving a potential
> failure rate after excluding infra issues of 15.7%
> - 2 bugs were responsible for 25 of these 37 failures
> * they are the "SSH protocol banner issue", and the well-knows DB
> lock timeouts
> - bug 1253896 (the infamous SSH timeout bug) was hit only twice. The
> elastic recheck count is much higher because failures for the SSH
> protocol banner error (1265495) are being classified as bug 1253896.
> * actually in the past 48 hours only 2 voting neutron jobs hit this
> failure. This is probably a great improvement compared with a few
> weeks ago.
> - Some failures are due to bug already known and tracked, other
> failures are due to bugs either unforeseen so far or not tracked. In
> the latter case a bug report has been filed.
>
> It seems therefore that there are two high priority bugs to address:
> 1) https://bugs.launchpad.net/neutron/+bug/1283522 (16 occurrences,
> 43.2% of failure, 6.67% globally)
> * Check whether we can resume the split between API server and RPC
> server discussion)
> 2) https://bugs.launchpad.net/neutron/+bug/1265495 (9/37 = 24.3% of
> failures, 3.75% globally)
>
> And several minor bugs (affecting tempest and/or neutron)
> Each one of the following bugs was found no more than twice in our
> analysis:
> 3) https://bugs.launchpad.net/neutron/+bug/1254890 (possibly a nova
> bug, but it hit the neutron full job once)
> 4) https://bugs.launchpad.net/neutron/+bug/1283599
> 5) https://bugs.launchpad.net/neutron/+bug/1277439
> 6) https://bugs.launchpad.net/neutron/+bug/1253896
> 7) https://bugs.launchpad.net/neutron/+bug/1253533
> 8) https://bugs.launchpad.net/tempest/+bug/1283535 (possibly not a
> neutron bug)
> 9) https://bugs.launchpad.net/tempest/+bug/1253993 (need to devise new
> solutions for improving agent loop times)
> * there is already a patch under review for bulking device details
> requests
> 10) https://bugs.launchpad.net/neutron/+bug/1283518
>
> In my humble opinion, it is therefore important to have immediately a
> plan for ensuring bugs #1 and #2 are solved or at least consistently
> mitigated by icehouse. It would also be good to identify assignees for
> bug #3 to bug #10.
>
> Regards,
> Salvatore
>
>
> On 21 February 2014 14:44, Sean Dague <sean at dague.net
> <mailto:sean at dague.net>> wrote:
>
> Yesterday during the QA meeting we realized that the neutron full job,
> which includes tenant isolation, and full parallelism, was passing
> quite
> often in the experimental queue. Which was actually news to most
> of us,
> as no one had been keeping a close eye on it.
>
> I moved that to a non-voting job on all projects. A spot check
> overnight
> is that it's failing about twice as often as the regular neutron job.
> Which is too high a failure rate to make it voting, but it's close.
>
> This would be the time for a final hard push by the neutron team
> to get
> to the bottom of these failures to bring the pass rate to the level of
> the existing neutron job, then we could make neutron full voting.
>
> This is a *huge* move forward from where things were at the Havana
> summit. I want to thank the Neutron team for getting so aggressive
> about
> getting this testing working. I was skeptical we could get there
> within
> the cycle, but a last push could actually get us neutron parity in the
> gate by i3.
>
> -Sean
>
> --
> Sean Dague
> Samsung Research America
> sean at dague.net <mailto:sean at dague.net> / sean.dague at samsung.com
> <mailto:sean.dague at samsung.com>
> http://dague.net
>
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> <mailto:OpenStack-dev at lists.openstack.org>
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
>
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140224/d891ed33/attachment.html>
More information about the OpenStack-dev
mailing list