<div dir="ltr">Thanks Salvatore and Jay for sharing your experiences on this issue. <div style><br></div><div style>I will look through the references you have provided to understand further as well. </div><div style>If I latch onto something, I will share back. <br>
</div><div style><br></div><div style>BTW, before posting the question here, I did suspect some race conditions and tried to play around with the timings of some of events - nothing really helped :-(</div><div style><br></div>
<div style><br></div><div style>regards..</div><div style>-Sukhdev</div><div style><br></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Thu, Jan 9, 2014 at 10:38 AM, Salvatore Orlando <span dir="ltr"><<a href="mailto:sorlando@nicira.com" target="_blank">sorlando@nicira.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Hi Jay,<div><br></div><div>replies inline.</div><div>I have probably have found one more cause for this issue in the logs, and I have added a comment to the bug report.</div>
<div><br></div><div>Salvatore<br>
<div class="gmail_extra"><br><br><div class="gmail_quote"><div class="im">On 9 January 2014 19:10, Jay Pipes <span dir="ltr"><<a href="mailto:jaypipes@gmail.com" target="_blank">jaypipes@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<div>On Thu, 2014-01-09 at 09:09 +0100, Salvatore Orlando wrote:<br>
> I am afraid I need to correct you Jay!<br>
<br>
</div>I always welcome corrections to things I've gotten wrong, so no worries<br>
at all!<br>
<div><br>
> This actually appears to be bug 1253896 [1]<br>
<br>
</div>Ah, the infamous "SSH bug" :) Yeah, so last night I spent a few hours<br>
digging through log files and running a variety of e-r queries trying to<br>
find some patterns for the bugs that Joe G had sent an ML post about.<br>
<br>
I went round in circles, unfortunately :( When I thought I'd found a<br>
pattern, invariably I would doubt my initial findings and wander into<br>
new areas in a wild goose chase.<br></blockquote><div><br></div></div><div>that's pretty much what I do all the time. </div><div class="im"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<br>
At various times, I thought something was up with the DHCP agent, as<br>
there were lots of "No DHCP Agent found" errors in the q-dhcp screen<br>
logs. But I could not correlate any relationship with the failures in<br>
the 4 bugs.<br></blockquote><div><br></div></div><div>I've seen those warning as well. They are pretty common, and I think they are actually benign, as the DHCP for the network is configured asynchronously, it is probably normal to see that message. </div>
<div>78</div><div class="im"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<br>
Then I started thinking that there was a timing/race condition where a<br>
security group was being added to the Nova-side servers cache before it<br>
had actually been constructed fully on the Neutron-side. But I was not<br>
able to fully track down the many, many debug messages that are involved<br>
in the full sequence of VM launch :( At around 4am, I gave up and went<br>
to bed...<br></blockquote><div><br></div></div><div>I have not investigated how this could impact connectivity. However, one thing that it's not ok in my opinion is that we have no way to know whether a security group is enforced or not; I think it needs an 'operational status'.</div>
<div>Note: we're working on a patch for the nicira plugin to add this concept; it's currently being developed as a plugin-specific extension, but if there is interest to support the concept also in the ml2 plugin I think we can just make it part of the 'core' security group API.</div>
<div class="im">
<div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<div><br>
> Technically, what we call 'bug' here is actually a failure<br>
> manifestation.<br>
> So far, we have removed several bugs causing this failure. The last<br>
> patch was pushed to devstack around Christmas.<br>
> Nevertheless, if you look at recent comments and Joe's email, we still<br>
> have a non-negligible failure rate on the gate.<br>
<br>
</div>Understood. I suspect actually that some of the various performance<br>
improvements from Phil Day and others around optimizing certain server<br>
and secgroup list calls have made the underlying race conditions show up<br>
more often -- since the list calls are completing much faster, which<br>
ironically gives Neutron less time to complete setup operations!<br></blockquote><div><br></div></div><div>That might be one explanation. The other might be the fact that we added another scenario test for neutron which creates more vms with floating ips and stuff, thus increasing the chances of hitting the timeout failure. </div>
<div class="im">
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<br>
So, a performance patch on the Nova side ends up putting more pressure<br>
on the Neutron side, which causes the rate of occurrence for these<br>
sticky bugs (with potentially many root causes) to spike.<br>
<br>
Such is life I guess :)<br>
<div><br>
> It is also worth mentioning that if you are running your tests with<br>
> parallelism enabled (ie: you're running tempest with tox -esmoke<br>
> rather than tox -esmokeserial) you will end up with a higher<br>
> occurrence of this failure due to more bugs causing it. These bugs are<br>
> due to some weakness in the OVS agent that we are addressing with<br>
> patches for blueprint neutron-tempest-parallel [2].<br>
<br>
</div>Interesting. If you wouldn't mind, what makes you think this is a<br>
weakness in the OVS agent? I would certainly appreciate your expertise<br>
in this area, since it would help me in my own bug-searching endeavors.<br>
<br></blockquote><div><br></div></div><div>Basically those are all the patches addressing the linked blueprint; I have added more info in the commit messages for the patches.</div><div>Also some of those patches target this bug as well: <a href="https://bugs.launchpad.net/neutron/+bug/1253993" target="_blank">https://bugs.launchpad.net/neutron/+bug/1253993</a></div>
<div><div class="h5">
<div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
All the best,<br>
-jay<br>
<div><div><br>
> Regards,<br>
> Salvatore<br>
><br>
><br>
><br>
><br>
> [1] <a href="https://bugs.launchpad.net/neutron/+bug/1253896" target="_blank">https://bugs.launchpad.net/neutron/+bug/1253896</a><br>
> [2] <a href="https://blueprints.launchpad.net/neutron/+spec/neutron-tempest-parallel" target="_blank">https://blueprints.launchpad.net/neutron/+spec/neutron-tempest-parallel</a><br>
><br>
><br>
> On 9 January 2014 05:38, Jay Pipes <<a href="mailto:jaypipes@gmail.com" target="_blank">jaypipes@gmail.com</a>> wrote:<br>
> On Wed, 2014-01-08 at 18:46 -0800, Sukhdev Kapur wrote:<br>
> > Dear fellow developers,<br>
><br>
> > I am running few Neutron tempest tests and noticing an<br>
> intermittent<br>
> > failure of tempest.scenario.test_network_basic_ops.<br>
><br>
> > I ran this test 50+ times and am getting intermittent<br>
> failure. The<br>
> > pass rate is apps. 70%. The 30% of the time it fails mostly<br>
> in<br>
> > _check_public_network_connectivity.<br>
><br>
> > Has anybody seen this?<br>
> > If there is a fix or work around for this, please share your<br>
> wisdom.<br>
><br>
><br>
> Unfortunately, I believe you are running into this bug:<br>
><br>
> <a href="https://bugs.launchpad.net/nova/+bug/1254890" target="_blank">https://bugs.launchpad.net/nova/+bug/1254890</a><br>
><br>
> The bug is Triaged in Nova (meaning, there is a suggested fix<br>
> in the bug<br>
> report). It's currently affecting the gate negatively and is<br>
> certainly<br>
> on the radar of the various PTLs affected.<br>
><br>
> Best,<br>
> -jay<br>
><br>
><br>
><br>
> _______________________________________________<br>
> OpenStack-dev mailing list<br>
> <a href="mailto:OpenStack-dev@lists.openstack.org" target="_blank">OpenStack-dev@lists.openstack.org</a><br>
> <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>
><br>
><br>
> _______________________________________________<br>
> OpenStack-dev mailing list<br>
> <a href="mailto:OpenStack-dev@lists.openstack.org" target="_blank">OpenStack-dev@lists.openstack.org</a><br>
> <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>
<br>
<br>
<br>
_______________________________________________<br>
OpenStack-dev mailing list<br>
<a href="mailto:OpenStack-dev@lists.openstack.org" target="_blank">OpenStack-dev@lists.openstack.org</a><br>
<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>
</div></div></blockquote></div></div></div><br></div></div></div>
<br>_______________________________________________<br>
OpenStack-dev mailing list<br>
<a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a><br>
<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>
<br></blockquote></div><br></div>