[openstack-dev] [Neutron][qa] Intermittent failure of tempest test test_network_basic_ops

Sukhdev Kapur sukhdevkapur at gmail.com
Thu Jan 9 19:55:09 UTC 2014


Thanks Salvatore and Jay for sharing your experiences on this issue.

I will look through the references you have provided to understand further
as well.
If I  latch onto something, I will share back.

BTW, before posting the question here, I did suspect some race conditions
and tried to play around with the timings of some of events - nothing
really helped :-(


regards..
-Sukhdev



On Thu, Jan 9, 2014 at 10:38 AM, Salvatore Orlando <sorlando at nicira.com>wrote:

> Hi Jay,
>
> replies inline.
> I have probably have found one more cause for this issue in the logs, and
> I have added a comment to the bug report.
>
> Salvatore
>
>
> On 9 January 2014 19:10, Jay Pipes <jaypipes at gmail.com> wrote:
>
>> On Thu, 2014-01-09 at 09:09 +0100, Salvatore Orlando wrote:
>> > I am afraid I need to correct you Jay!
>>
>> I always welcome corrections to things I've gotten wrong, so no worries
>> at all!
>>
>> > This actually appears to be bug 1253896 [1]
>>
>> Ah, the infamous "SSH bug" :) Yeah, so last night I spent a few hours
>> digging through log files and running a variety of e-r queries trying to
>> find some patterns for the bugs that Joe G had sent an ML post about.
>>
>> I went round in circles, unfortunately :( When I thought I'd found a
>> pattern, invariably I would doubt my initial findings and wander into
>> new areas in a wild goose chase.
>>
>
> that's pretty much what I do all the time.
>
>>
>> At various times, I thought something was up with the DHCP agent, as
>> there were lots of "No DHCP Agent found" errors in the q-dhcp screen
>> logs. But I could not correlate any relationship with the failures in
>> the 4 bugs.
>>
>
> I've seen those warning as well. They are pretty common, and I think they
> are actually benign, as the DHCP for the network is configured
> asynchronously, it is probably normal to see that message.
> 78
>
>>
>> Then I started thinking that there was a timing/race condition where a
>> security group was being added to the Nova-side servers cache before it
>> had actually been constructed fully on the Neutron-side. But I was not
>> able to fully track down the many, many debug messages that are involved
>> in the full sequence of VM launch :( At around 4am, I gave up and went
>> to bed...
>>
>
> I have not investigated how this could impact connectivity. However, one
> thing that it's not ok in my opinion is that we have no way to know whether
> a security group is enforced or not; I think it needs an 'operational
> status'.
> Note: we're working on a patch for the nicira plugin to add this concept;
> it's currently being developed as a plugin-specific extension, but if there
> is interest to support the concept also in the ml2 plugin I think we can
> just make it part of the 'core' security group API.
>
>
>> > Technically, what we call 'bug' here is actually a failure
>> > manifestation.
>> > So far, we have removed several bugs causing this failure. The last
>> > patch was pushed to devstack around Christmas.
>> > Nevertheless, if you look at recent comments and Joe's email, we still
>> > have a non-negligible failure rate on the gate.
>>
>> Understood. I suspect actually that some of the various performance
>> improvements from Phil Day and others around optimizing certain server
>> and secgroup list calls have made the underlying race conditions show up
>> more often -- since the list calls are completing much faster, which
>> ironically gives Neutron less time to complete setup operations!
>>
>
> That might be one explanation. The other might be the fact that we added
> another scenario test for neutron which creates more vms with floating ips
> and stuff, thus increasing the chances of hitting the timeout failure.
>
>>
>> So, a performance patch on the Nova side ends up putting more pressure
>> on the Neutron side, which causes the rate of occurrence for these
>> sticky bugs (with potentially many root causes) to spike.
>>
>> Such is life I guess :)
>>
>> > It is also worth mentioning that if you are running your tests with
>> > parallelism enabled (ie: you're running tempest with tox -esmoke
>> > rather than tox -esmokeserial) you will end up with a higher
>> > occurrence of this failure due to more bugs causing it. These bugs are
>> > due to some weakness in the OVS agent that we are addressing with
>> > patches for blueprint neutron-tempest-parallel [2].
>>
>> Interesting. If you wouldn't mind, what makes you think this is a
>> weakness in the OVS agent? I would certainly appreciate your expertise
>> in this area, since it would help me in my own bug-searching endeavors.
>>
>>
> Basically those are all the patches addressing the linked blueprint; I
> have added more info in the commit messages for the patches.
> Also some of those patches target this bug as well:
> https://bugs.launchpad.net/neutron/+bug/1253993
>
>
>> All the best,
>> -jay
>>
>> > Regards,
>> > Salvatore
>> >
>> >
>> >
>> >
>> > [1] https://bugs.launchpad.net/neutron/+bug/1253896
>> > [2]
>> https://blueprints.launchpad.net/neutron/+spec/neutron-tempest-parallel
>> >
>> >
>> > On 9 January 2014 05:38, Jay Pipes <jaypipes at gmail.com> wrote:
>> >         On Wed, 2014-01-08 at 18:46 -0800, Sukhdev Kapur wrote:
>> >         > Dear fellow developers,
>> >
>> >         > I am running few Neutron tempest tests and noticing an
>> >         intermittent
>> >         > failure of tempest.scenario.test_network_basic_ops.
>> >
>> >         > I ran this test 50+ times and am getting intermittent
>> >         failure. The
>> >         > pass rate is apps. 70%. The 30% of the time it fails mostly
>> >         in
>> >         > _check_public_network_connectivity.
>> >
>> >         > Has anybody seen this?
>> >         > If there is a fix or work around for this, please share your
>> >         wisdom.
>> >
>> >
>> >         Unfortunately, I believe you are running into this bug:
>> >
>> >         https://bugs.launchpad.net/nova/+bug/1254890
>> >
>> >         The bug is Triaged in Nova (meaning, there is a suggested fix
>> >         in the bug
>> >         report). It's currently affecting the gate negatively and is
>> >         certainly
>> >         on the radar of the various PTLs affected.
>> >
>> >         Best,
>> >         -jay
>> >
>> >
>> >
>> >         _______________________________________________
>> >         OpenStack-dev mailing list
>> >         OpenStack-dev at lists.openstack.org
>> >
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>> >
>> >
>> > _______________________________________________
>> > OpenStack-dev mailing list
>> > OpenStack-dev at lists.openstack.org
>> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>>
>>
>> _______________________________________________
>> OpenStack-dev mailing list
>> OpenStack-dev at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140109/3626af4c/attachment.html>


More information about the OpenStack-dev mailing list