[openstack-dev] [Neutron][qa] Intermittent failure of tempest test test_network_basic_ops

Salvatore Orlando sorlando at nicira.com
Thu Jan 9 18:38:27 UTC 2014


Hi Jay,

replies inline.
I have probably have found one more cause for this issue in the logs, and I
have added a comment to the bug report.

Salvatore


On 9 January 2014 19:10, Jay Pipes <jaypipes at gmail.com> wrote:

> On Thu, 2014-01-09 at 09:09 +0100, Salvatore Orlando wrote:
> > I am afraid I need to correct you Jay!
>
> I always welcome corrections to things I've gotten wrong, so no worries
> at all!
>
> > This actually appears to be bug 1253896 [1]
>
> Ah, the infamous "SSH bug" :) Yeah, so last night I spent a few hours
> digging through log files and running a variety of e-r queries trying to
> find some patterns for the bugs that Joe G had sent an ML post about.
>
> I went round in circles, unfortunately :( When I thought I'd found a
> pattern, invariably I would doubt my initial findings and wander into
> new areas in a wild goose chase.
>

that's pretty much what I do all the time.

>
> At various times, I thought something was up with the DHCP agent, as
> there were lots of "No DHCP Agent found" errors in the q-dhcp screen
> logs. But I could not correlate any relationship with the failures in
> the 4 bugs.
>

I've seen those warning as well. They are pretty common, and I think they
are actually benign, as the DHCP for the network is configured
asynchronously, it is probably normal to see that message.
78

>
> Then I started thinking that there was a timing/race condition where a
> security group was being added to the Nova-side servers cache before it
> had actually been constructed fully on the Neutron-side. But I was not
> able to fully track down the many, many debug messages that are involved
> in the full sequence of VM launch :( At around 4am, I gave up and went
> to bed...
>

I have not investigated how this could impact connectivity. However, one
thing that it's not ok in my opinion is that we have no way to know whether
a security group is enforced or not; I think it needs an 'operational
status'.
Note: we're working on a patch for the nicira plugin to add this concept;
it's currently being developed as a plugin-specific extension, but if there
is interest to support the concept also in the ml2 plugin I think we can
just make it part of the 'core' security group API.


> > Technically, what we call 'bug' here is actually a failure
> > manifestation.
> > So far, we have removed several bugs causing this failure. The last
> > patch was pushed to devstack around Christmas.
> > Nevertheless, if you look at recent comments and Joe's email, we still
> > have a non-negligible failure rate on the gate.
>
> Understood. I suspect actually that some of the various performance
> improvements from Phil Day and others around optimizing certain server
> and secgroup list calls have made the underlying race conditions show up
> more often -- since the list calls are completing much faster, which
> ironically gives Neutron less time to complete setup operations!
>

That might be one explanation. The other might be the fact that we added
another scenario test for neutron which creates more vms with floating ips
and stuff, thus increasing the chances of hitting the timeout failure.

>
> So, a performance patch on the Nova side ends up putting more pressure
> on the Neutron side, which causes the rate of occurrence for these
> sticky bugs (with potentially many root causes) to spike.
>
> Such is life I guess :)
>
> > It is also worth mentioning that if you are running your tests with
> > parallelism enabled (ie: you're running tempest with tox -esmoke
> > rather than tox -esmokeserial) you will end up with a higher
> > occurrence of this failure due to more bugs causing it. These bugs are
> > due to some weakness in the OVS agent that we are addressing with
> > patches for blueprint neutron-tempest-parallel [2].
>
> Interesting. If you wouldn't mind, what makes you think this is a
> weakness in the OVS agent? I would certainly appreciate your expertise
> in this area, since it would help me in my own bug-searching endeavors.
>
>
Basically those are all the patches addressing the linked blueprint; I have
added more info in the commit messages for the patches.
Also some of those patches target this bug as well:
https://bugs.launchpad.net/neutron/+bug/1253993


> All the best,
> -jay
>
> > Regards,
> > Salvatore
> >
> >
> >
> >
> > [1] https://bugs.launchpad.net/neutron/+bug/1253896
> > [2]
> https://blueprints.launchpad.net/neutron/+spec/neutron-tempest-parallel
> >
> >
> > On 9 January 2014 05:38, Jay Pipes <jaypipes at gmail.com> wrote:
> >         On Wed, 2014-01-08 at 18:46 -0800, Sukhdev Kapur wrote:
> >         > Dear fellow developers,
> >
> >         > I am running few Neutron tempest tests and noticing an
> >         intermittent
> >         > failure of tempest.scenario.test_network_basic_ops.
> >
> >         > I ran this test 50+ times and am getting intermittent
> >         failure. The
> >         > pass rate is apps. 70%. The 30% of the time it fails mostly
> >         in
> >         > _check_public_network_connectivity.
> >
> >         > Has anybody seen this?
> >         > If there is a fix or work around for this, please share your
> >         wisdom.
> >
> >
> >         Unfortunately, I believe you are running into this bug:
> >
> >         https://bugs.launchpad.net/nova/+bug/1254890
> >
> >         The bug is Triaged in Nova (meaning, there is a suggested fix
> >         in the bug
> >         report). It's currently affecting the gate negatively and is
> >         certainly
> >         on the radar of the various PTLs affected.
> >
> >         Best,
> >         -jay
> >
> >
> >
> >         _______________________________________________
> >         OpenStack-dev mailing list
> >         OpenStack-dev at lists.openstack.org
> >
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
> >
> > _______________________________________________
> > OpenStack-dev mailing list
> > OpenStack-dev at lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140109/cbfba72b/attachment.html>


More information about the OpenStack-dev mailing list