Open Stack

Wed Jul 8 18:21:10 UTC 2015

Hi,

Another brief update on the matter:

Failure rate trends [1] are showing that unstable (w/ multiple API workers
+ pymysql driver) and stable configurations (w/o) are virtually aligned and
I am proposing that it is time to drop the unstable infra configuration
[2,3] that allowed the team to triage/experiment and get to a solution. I'll
watch [1] a little longer before I think it's safe to claim that we're out
of the woods.

Cheers,
Armando

[1] http://goo.gl/YM7gUC
[2] https://review.openstack.org/#/c/199668/
[3] https://review.openstack.org/#/c/199672/

On 22 June 2015 at 14:10, Armando M. <armamig at gmail.com> wrote:

> Hi,
>
> A brief update on the issue that sparked this thread:
>
> A little over a week ago, bug [1] was filed. The gist of that was that the
> switch to pymysql unveiled a number of latent race conditions that made
> Neutron unstable.
>
> To try and nip these in the bud, the Neutron team filed a number of
> patches [2], to create an unstable configuration that would allow them to
> troubleshoot and experiment a solution, by still keeping the stability in
> check (a preliminary proposal for a fix has been available in [4]).
>
> The latest failure rate trend is shown in [3]; as you can see, we're still
> gathering data, but it seems that the instability gap between the two jobs
> (stable vs unstable) has widened, and should give us plenty of data points
> to devise a resolution strategy.
>
> I have documented the most recurrent traces in the bug report [1].
>
> Will update once we managed to get the two curves to kiss each other again
> and close to a more acceptable failure rate.
>
> Cheers,
> Armando
>
> [1] https://bugs.launchpad.net/neutron/+bug/1464612
> [2] https://review.openstack.org/#/q/topic:neutron-unstable,n,z
> [3] http://goo.gl/YM7gUC
> [4] https://review.openstack.org/#/c/191540/
>
>
> On 12 June 2015 at 11:13, Boris Pavlovic <bpavlovic at mirantis.com> wrote:
>
>> Sean,
>>
>> Thanks for quick fix/revert https://review.openstack.org/#/c/191010/
>> This unblocked Rally gates...
>>
>> Best regards,
>> Boris Pavlovic
>>
>> On Fri, Jun 12, 2015 at 8:56 PM, Clint Byrum <clint at fewbar.com> wrote:
>>
>>> Excerpts from Mike Bayer's message of 2015-06-12 09:42:42 -0700:
>>> >
>>> > On 6/12/15 11:37 AM, Mike Bayer wrote:
>>> > >
>>> > >
>>> > > On 6/11/15 9:32 PM, Eugene Nikanorov wrote:
>>> > >> Hi neutrons,
>>> > >>
>>> > >> I'd like to draw your attention to an issue discovered by rally
>>> gate job:
>>> > >>
>>> http://logs.openstack.org/96/190796/4/check/gate-rally-dsvm-neutron-rally/7a18e43/logs/screen-q-svc.txt.gz?level=TRACE
>>> > >>
>>> > >> I don't have bandwidth to take a deep look at it, but first
>>> > >> impression is that it is some issue with nested transaction support
>>> > >> either on sqlalchemy or pymysql side.
>>> > >> Also, besides errors with nested transactions, there are a lot of
>>> > >> Lock wait timeouts.
>>> > >>
>>> > >> I think it makes sense to start with reverting the patch that moves
>>> > >> to pymysql.
>>> > > My immediate reaction is that this is perhaps a concurrency-related
>>> > > issue; because PyMySQL is pure python and allows for full blown
>>> > > eventlet monkeypatching, I wonder if somehow the same PyMySQL
>>> > > connection is being used in multiple contexts. E.g. one greenlet
>>> > > starts up a savepoint, using identifier "_3" which is based on a
>>> > > counter that is local to the SQLAlchemy Connection, but then another
>>> > > greenlet shares that PyMySQL connection somehow with another
>>> > > SQLAlchemy Connection that uses the same identifier.
>>> >
>>> > reading more of the log, it seems the main issue is just that there's a
>>> > deadlock on inserting into the securitygroups table.  The deadlock on
>>> > insert can be because of an index being locked.
>>> >
>>> >
>>> > I'd be curious to know how many greenlets are running concurrently
>>> here,
>>> > and what the overall transaction looks like within the operation that
>>> is
>>> > failing here (e.g. does each transaction insert multiple rows into
>>> > securitygroups?  that would make a deadlock seem more likely).
>>>
>>> This begs two questions:
>>>
>>> 1) Are we handling deadlocks with retries? It's important that we do
>>> that to be defensive.
>>>
>>> 2) Are we being careful to sort the table order in any multi-table
>>> transactions so that we minimize the chance of deadlocks happening
>>> because of any cross table deadlocks?
>>>
>>>
>>> __________________________________________________________________________
>>> OpenStack Development Mailing List (not for usage questions)
>>> Unsubscribe:
>>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>
>>
>>
>> __________________________________________________________________________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe:
>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20150708/d5f2987c/attachment.html>

Open Stack

[openstack-dev] [Neutron] Issue with pymysql

OpenStack

Community

Documentation

Branding & Legal