<div dir="ltr"><div><span style="font-size:12.8000001907349px">Hi,</span><div style="font-size:12.8000001907349px"><br></div><div style="font-size:12.8000001907349px">Another brief update on the matter:</div><div style="font-size:12.8000001907349px"><br></div><div style="font-size:12.8000001907349px">Failure rate trends [1] are showing that unstable (w/ multiple API workers + pymysql driver) and stable configurations (w/o) are virtually aligned and I am proposing that it is time to drop the unstable infra configuration [2,3] that allowed the team to triage/experiment and get to a solution. I<span style="font-size:12.8000001907349px">'ll watch [1] a little longer before I think it's safe to claim that we're out of the woods.</span></div></div><div style="font-size:12.8000001907349px"><br></div><div style="font-size:12.8000001907349px"><span style="font-size:12.8000001907349px">Cheers,</span></div><div style="font-size:12.8000001907349px"><span style="font-size:12.8000001907349px">Armando </span><br></div><div style="font-size:12.8000001907349px"><br></div><div>[1] <a href="http://goo.gl/YM7gUC" target="_blank">http://goo.gl/YM7gUC</a><br></div><div>[2] <a href="https://review.openstack.org/#/c/199668/">https://review.openstack.org/#/c/199668/</a></div><div>[3] <a href="https://review.openstack.org/#/c/199672/">https://review.openstack.org/#/c/199672/</a></div></div><div class="gmail_extra"><br><div class="gmail_quote">On 22 June 2015 at 14:10, Armando M. <span dir="ltr"><<a href="mailto:armamig@gmail.com" target="_blank">armamig@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Hi,<div><br></div><div>A brief update on the issue that sparked this thread:</div><div><br></div><div>A little over a week ago, bug [1] was filed. The gist of that was that the switch to pymysql unveiled a number of latent race conditions that made Neutron unstable.</div><div><br></div><div>To try and nip these in the bud, the Neutron team filed a number of patches [2], to create an unstable configuration that would allow them to troubleshoot and experiment a solution, by still keeping the stability in check (a preliminary proposal for a fix has been available in [4]).</div><div><br></div><div>The latest failure rate trend is shown in [3]; as you can see, we're still gathering data, but it seems that the instability gap between the two jobs (stable vs unstable) has widened, and should give us plenty of data points to devise a resolution strategy.</div><div><br></div><div>I have documented the most recurrent traces in the bug report [1].</div><div><br></div><div>Will update once we managed to get the two curves to kiss each other again and close to a more acceptable failure rate.</div><div><br></div><div>Cheers,</div><div>Armando</div><div><br></div><div>[1] <a href="https://bugs.launchpad.net/neutron/+bug/1464612" target="_blank">https://bugs.launchpad.net/neutron/+bug/1464612</a></div><div><div><div>[2] <a href="https://review.openstack.org/#/q/topic:neutron-unstable,n,z" target="_blank">https://review.openstack.org/#/q/topic:neutron-unstable,n,z</a></div><div>[3] <a href="http://goo.gl/YM7gUC" target="_blank">http://goo.gl/YM7gUC</a></div></div></div><div>[4] <a href="https://review.openstack.org/#/c/191540/" target="_blank">https://review.openstack.org/#/c/191540/</a></div><div><br></div></div><div class="HOEnZb"><div class="h5"><div class="gmail_extra"><br><div class="gmail_quote">On 12 June 2015 at 11:13, Boris Pavlovic <span dir="ltr"><<a href="mailto:bpavlovic@mirantis.com" target="_blank">bpavlovic@mirantis.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Sean, <div><br></div><div>Thanks for quick fix/revert <a href="https://review.openstack.org/#/c/191010/" target="_blank">https://review.openstack.org/#/c/191010/</a> </div><div>This unblocked Rally gates...</div><div><br></div><div>Best regards,</div><div>Boris Pavlovic </div></div><div><div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Jun 12, 2015 at 8:56 PM, Clint Byrum <span dir="ltr"><<a href="mailto:clint@fewbar.com" target="_blank">clint@fewbar.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Excerpts from Mike Bayer's message of 2015-06-12 09:42:42 -0700:<br>

<div><div>><br>

> On 6/12/15 11:37 AM, Mike Bayer wrote:<br>

> ><br>

> ><br>

> > On 6/11/15 9:32 PM, Eugene Nikanorov wrote:<br>

> >> Hi neutrons,<br>

> >><br>

> >> I'd like to draw your attention to an issue discovered by rally gate job:<br>

> >> <a href="http://logs.openstack.org/96/190796/4/check/gate-rally-dsvm-neutron-rally/7a18e43/logs/screen-q-svc.txt.gz?level=TRACE" rel="noreferrer" target="_blank">http://logs.openstack.org/96/190796/4/check/gate-rally-dsvm-neutron-rally/7a18e43/logs/screen-q-svc.txt.gz?level=TRACE</a><br>

> >><br>

> >> I don't have bandwidth to take a deep look at it, but first<br>

> >> impression is that it is some issue with nested transaction support<br>

> >> either on sqlalchemy or pymysql side.<br>

> >> Also, besides errors with nested transactions, there are a lot of<br>

> >> Lock wait timeouts.<br>

> >><br>

> >> I think it makes sense to start with reverting the patch that moves<br>

> >> to pymysql.<br>

> > My immediate reaction is that this is perhaps a concurrency-related<br>

> > issue; because PyMySQL is pure python and allows for full blown<br>

> > eventlet monkeypatching, I wonder if somehow the same PyMySQL<br>

> > connection is being used in multiple contexts. E.g. one greenlet<br>

> > starts up a savepoint, using identifier "_3" which is based on a<br>

> > counter that is local to the SQLAlchemy Connection, but then another<br>

> > greenlet shares that PyMySQL connection somehow with another<br>

> > SQLAlchemy Connection that uses the same identifier.<br>

><br>

> reading more of the log, it seems the main issue is just that there's a<br>

> deadlock on inserting into the securitygroups table.  The deadlock on<br>

> insert can be because of an index being locked.<br>

><br>

><br>

> I'd be curious to know how many greenlets are running concurrently here,<br>

> and what the overall transaction looks like within the operation that is<br>

> failing here (e.g. does each transaction insert multiple rows into<br>

> securitygroups?  that would make a deadlock seem more likely).<br>

<br>

</div></div>This begs two questions:<br>

<br>

1) Are we handling deadlocks with retries? It's important that we do<br>

that to be defensive.<br>

<br>

2) Are we being careful to sort the table order in any multi-table<br>

transactions so that we minimize the chance of deadlocks happening<br>

because of any cross table deadlocks?<br>

<div><div><br>

__________________________________________________________________________<br>

OpenStack Development Mailing List (not for usage questions)<br>

Unsubscribe: <a href="http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe" rel="noreferrer" target="_blank">OpenStack-dev-request@lists.openstack.org?subject:unsubscribe</a><br>

<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" rel="noreferrer" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>

</div></div></blockquote></div><br></div>

</div></div><br>__________________________________________________________________________<br>

OpenStack Development Mailing List (not for usage questions)<br>

Unsubscribe: <a href="http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe" rel="noreferrer" target="_blank">OpenStack-dev-request@lists.openstack.org?subject:unsubscribe</a><br>

<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" rel="noreferrer" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>

<br></blockquote></div><br></div>

</div></div></blockquote></div><br></div>