[openstack-dev] [qa] [neutron] Neutron Full Parallel job very close to voting - call to arms by neutron team

Rossella Sblendido rsblendido at suse.com
Mon Feb 24 11:14:48 UTC 2014


Ciao Salvatore,

thanks a lot for analyzing the failures!

This link is not working for me:
7) https://bugs.launchpad.net/neutron/+bug/1253533

I took a minor bug that was not assigned. Most of the bugs are assigned 
to you, I was wondering if you´d use some help. I guess we can 
coordinate better when you are online.

cheers,

Rossella

On 02/23/2014 03:14 AM, Salvatore Orlando wrote:
> I have tried to collect more information on neutron full job failures.
>
> So far there have been 219 failures and 891 successes, for an overall 
> success rate of 19.8% which is inline with Sean's evaluation.
> The count has performed exclusively on jobs executed against master 
> branch. The failure rate for stable/havana is higher; indeed the job 
> there still triggers bug 1273386 as it performs nbd mounting, and 
> several fixes for the l2/l3 agents were not backported (or not 
> backportable).
>
> It is worth noting that actually some of the failures were because of 
> infra issues. Unfortunately, it is not obvious to me how to define a 
> logstash query for that. Nevertheless, it will be better to err on the 
> side of safety and estimate failure rate to be about 20%.
>
> I did then a classification of 63 failures, finding out the following:
> - 25 failures were for infra issues, 1 failure was due to a flaw in a 
> patch, leaving 37 "real" failures to analyse
>    * In the same timeframe 203 jobs succeeded, giving a potential 
> failure rate after excluding infra issues of 15.7%
> - 2 bugs were responsible for 25 of these 37 failures
>    * they are the "SSH protocol banner issue", and the well-knows DB 
> lock timeouts
> - bug 1253896 (the infamous SSH timeout bug) was hit only twice. The 
> elastic recheck count is much higher because failures for the SSH 
> protocol banner error (1265495) are being classified as bug 1253896.
>    * actually in the past 48 hours only 2 voting neutron jobs hit this 
> failure. This is probably a great improvement compared with a few 
> weeks ago.
> - Some failures are due to bug already known and tracked, other 
> failures are due to bugs either unforeseen so far or not tracked. In 
> the latter case a bug report has been filed.
>
> It seems therefore that there are two high priority bugs to address:
> 1) https://bugs.launchpad.net/neutron/+bug/1283522 (16 occurrences, 
> 43.2% of failure, 6.67% globally)
>     * Check whether we can resume the split between API server and RPC 
> server discussion)
> 2) https://bugs.launchpad.net/neutron/+bug/1265495 (9/37 = 24.3% of 
> failures, 3.75% globally)
>
> And several minor bugs (affecting tempest and/or neutron)
> Each one of the following bugs was found no more than twice in our 
> analysis:
> 3) https://bugs.launchpad.net/neutron/+bug/1254890 (possibly a nova 
> bug, but it hit the neutron full job once)
> 4) https://bugs.launchpad.net/neutron/+bug/1283599
> 5) https://bugs.launchpad.net/neutron/+bug/1277439
> 6) https://bugs.launchpad.net/neutron/+bug/1253896
> 7) https://bugs.launchpad.net/neutron/+bug/1253533
> 8) https://bugs.launchpad.net/tempest/+bug/1283535 (possibly not a 
> neutron bug)
> 9) https://bugs.launchpad.net/tempest/+bug/1253993 (need to devise new 
> solutions for improving agent loop times)
>    * there is already a patch under review for bulking device details 
> requests
> 10) https://bugs.launchpad.net/neutron/+bug/1283518
>
> In my humble opinion, it is therefore important to have immediately a 
> plan for ensuring bugs #1 and #2 are solved or at least consistently 
> mitigated by icehouse. It would also be good to identify assignees for 
> bug #3 to bug #10.
>
> Regards,
> Salvatore
>
>
> On 21 February 2014 14:44, Sean Dague <sean at dague.net 
> <mailto:sean at dague.net>> wrote:
>
>     Yesterday during the QA meeting we realized that the neutron full job,
>     which includes tenant isolation, and full parallelism, was passing
>     quite
>     often in the experimental queue. Which was actually news to most
>     of us,
>     as no one had been keeping a close eye on it.
>
>     I moved that to a non-voting job on all projects. A spot check
>     overnight
>     is that it's failing about twice as often as the regular neutron job.
>     Which is too high a failure rate to make it voting, but it's close.
>
>     This would be the time for a final hard push by the neutron team
>     to get
>     to the bottom of these failures to bring the pass rate to the level of
>     the existing neutron job, then we could make neutron full voting.
>
>     This is a *huge* move forward from where things were at the Havana
>     summit. I want to thank the Neutron team for getting so aggressive
>     about
>     getting this testing working. I was skeptical we could get there
>     within
>     the cycle, but a last push could actually get us neutron parity in the
>     gate by i3.
>
>             -Sean
>
>     --
>     Sean Dague
>     Samsung Research America
>     sean at dague.net <mailto:sean at dague.net> / sean.dague at samsung.com
>     <mailto:sean.dague at samsung.com>
>     http://dague.net
>
>
>     _______________________________________________
>     OpenStack-dev mailing list
>     OpenStack-dev at lists.openstack.org
>     <mailto:OpenStack-dev at lists.openstack.org>
>     http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
>
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140224/d891ed33/attachment.html>


More information about the OpenStack-dev mailing list