[openstack-dev] [nova][neutron][mysql] IMPORTANT: MySQL Galera does *not* support SELECT ... FOR UPDATE

Mandeep Dhami dhami at noironetworks.com
Mon May 19 22:45:48 UTC 2014


> On a side note, I'm rather ignorant on python frameworks for distributed
coordination... concoord?
> Is zookeper something that should be ruled out because of language
restrictions?

I have not used zookeeper, so there might be reasons to use it where we
have to write java code,
but as long as zookeeper is added as a "package" dependency (apt-get/yum
install zookeeper),
with a python front-end* the language should not matter.

*  (like kazoo? I will need to investigate)

Regards,
Mandeep


On Mon, May 19, 2014 at 3:26 PM, Salvatore Orlando <sorlando at nicira.com>wrote:

> Some comments inline.
>
> Salvatore
>
>
> On 19 May 2014 20:32, sridhar basam <sridhar.basam at gmail.com> wrote:
>
>>
>>
>>
>> On Mon, May 19, 2014 at 1:30 PM, Jay Pipes <jaypipes at gmail.com> wrote:
>>
>>> Stackers,
>>>
>>> On Friday in Atlanta, I had the pleasure of moderating the database
>>> session at the Ops Meetup track. We had lots of good discussions and heard
>>> important feedback from operators on DB topics.
>>>
>>> For the record, I would not bring this point up so publicly unless I
>>> believed it was a serious problem affecting a large segment of users. When
>>> doing an informal survey of the users/operators in the room at the start of
>>> the session, out of approximately 200 people in the room, only a single
>>> person was using PostgreSQL, about a dozen were using standard MySQL
>>> master/slave replication, and the rest were using MySQL Galera clustering.
>>> So, this is a real issue for a large segment of the operators -- or at
>>> least the ones at the session. :)
>>>
>>>
>> ​We are one of those operators that use Galera for replicating our mysql
>> databases. We used to  see issues with deadlocks when having multiple mysql
>> writers in our mysql cluster. As a workaround we have our haproxy
>> configuration in an active-standby configuration for our mysql VIP.
>>
>> I seem to recall we had a lot of the deadlocks happen through Neutron.
>> When we go through our Icehouse testing, we will redo our multimaster mysql
>> setup and provide feedback on the issues we see.
>>
>
> The SELECT... FOR UPDATE issue is going to be a non trivial one for
> neutron as well. Some components, like IPAM, heavily rely on it.
> However, Neutron is a lot more susceptible to deadlock problems than nova
> because it does not implement at the moment a retry mechanism.
> This is something which should be added during the Juno release cycle
> regardless of all the other enhancement currently being planned, such as
> task oriented operations.
>
>>
>> thanks,
>>          Sridhar
>>
>>
>>
>>> Peter Boros, from Percona, was able to provide some insight on MySQL
>>> Galera topics, and one issue came up that is likely the cause of a lot of
>>> heartache for operators who use MySQL Galera (or Percona XtraDB Cluster).
>>>
>>> We were discussing whether people had seen deadlock issues [1] when
>>> using MySQL Galera in their deployment, and were brainstorming on why
>>> deadlocks might be seen. I had suggested that perhaps Nova's use of
>>> autoincrementing primary keys may have been the cause. Peter pretty quickly
>>> dispatched that notion, saying that Galera automatically handles
>>> autoincrementing keys using managed innodb_autoincrement_increment and
>>> innodb_autoincrement_offset config options.
>>>
>>> I think at that point I mentioned that there were a number of places
>>> that were using the SELECT ... FOR UPDATE construct in Nova (in SQLAlchemy,
>>> it's the with_lockmode('update') modification of the query object). Peter
>>> promptly said that was a problem. MySQL Galera does not support SELECT ...
>>> FOR UPDATE, since it has no concept of cross-node locking of records and
>>> results are non-deterministic.
>>>
>>> So... what to do?
>>>
>>> For starters, some information on the use of with_lockmode() in Nova and
>>> Neutron...
>>>
>>> Within Nova, there are actually only a few places where
>>> with_lockmode('update') is used. Unfortunately, the use of
>>> with_lockmode('update') is in the quota code, which tends to wrap largish
>>> blocks of code within the Nova compute execution code.
>>>
>>> Within Neutron, however, the use of with_lockmode('update') is all over
>>> the place. There are 44 separate uses of it in 11 different files.
>>>
>>>
> I will report on a separate thread on this, so that we can have an
> assessment of where locking statements are used and why.
>
>
>>  We have a number of options:
>>>
>>
> I thin option 0 should be to rework/redesign the code, where possible, to
> avoid DB-level locking at all.
>
>
>>
>>> 1) Stop using MySQL Galera for databases of projects that contain
>>> with_lockmode('update')
>>>
>>
> This looks hideous, but I am afraid this is what all people wishing to
> deploy Icehouse should consider doing.
>
>
>>
>>> 2) Put a big old warning in the docs somewhere about the problem of
>>> potential deadlocks or odd behaviour with Galera in these projects
>>>
>>> 3) For Nova and Neutron, remove the use of with_lockmode('update') and
>>> instead use a coarse-grained file lock or a distributed lock manager for
>>> those areas where we need deterministic reads or quiescence.
>>>
>>
> We had an attempt at implementing a sort of distributed lock for neutron:
> https://review.openstack.org/#/c/34695/
> Beyond the implementation reservations on this patch, one thing that
> should be noticed, probably needless to say, is that distributed
> coordination is something that should never be taken in a light-hearted way.
> Once all the non-locking solution have been ruled out, distributed
> coordination among processes could be considered. In that case I think it
> might be better to use some OTS software rather than working out some home
> grown solution (I surely do not see space for a new project here)
> On a side note, I'm rather ignorant on python frameworks for distributed
> coordination... concoord? Is zookeper something that should be ruled out
> because of language restrictions?
>
>
>
>>
>>> 4) For the Nova db quota driver, refactor the driver to either use a
>>> non-locking method for reservation and quota queries or move the driver out
>>> into its own projects (or use something like Climate and make sure that
>>> Climate uses a non-blocking algorithm for those queries...)
>>>
>>> Thoughts?
>>>
>>> -jay
>>>
>>> [1] http://lists.openstack.org/pipermail/openstack/2014-May/007202.html
>>>
>>> _______________________________________________
>>> OpenStack-dev mailing list
>>> OpenStack-dev at lists.openstack.org
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>
>>
>>
>> _______________________________________________
>> OpenStack-dev mailing list
>> OpenStack-dev at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>>
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140519/6f0f26cc/attachment.html>


More information about the OpenStack-dev mailing list