[openstack-dev] [Neutron] db-level locks, non-blocking algorithms, active/active DB clusters and IPAM
sorlando at nicira.com
Mon Feb 23 12:07:38 UTC 2015
I am doing some work on Neutron IPAM code for IP Allocation, and I need to
found whether it's better to use db locking queries (SELECT ... FOR UPDATE)
or some sort of non-blocking algorithm.
Some measures suggest that for this specific problem db-level locking is
more efficient even when using multi-master DB clusters, which kind of
counters recent findings by other contributors ... but also backs those
from others .
The long and boring thing:
In the past few months there's been a fair amount of discussion concerning
the use of locking queries (ie: SELECT ... FOR UPDATE) in various OpenStack
projects, especially in connection with the fact that this kind of queries
is likely to trigger write set certification failures in synchronous db
replication engines such as MySQL Galera - as pointed out in  and
thoroughly analysed in .
Neutron, in particular, uses this construct often - and the correctness of
the whole IPAM logic depends on it. On this regard Neutron is also guilty
of not implementing any sort of retry mechanism. Eugene, Rossella, and
others have been working towards fixing this issue. Some examples of their
work can be found at  and .
However, techniques such as "compare and swap" described in  can be used
for implementing non-blocking algorithms which avoid at all the write-set
A neutron example of such technique can be found in . However, a bug in
this process found by Attila , and the subsequent gerrit discussion ,
raised further questions on locking vs non-blocking solutions.
At this stage, we know that using database-level locks triggers write-set
certification failures in synchronous multi master modes. This failures are
reported as deadlock errors. Such errors can be dealt with retrying the
transaction. However, this is claimed as not efficient in , whereas
Attila in  argues the opposite.
As I'm rewriting Neutron's IPAM logic as a part of , which heavily
relies on this construct, I decided to spend some time to put everything to
To this aim I devised 3 algorithms.
A-1) The classic one which locks the availability ranges for a subnet until
allocation is complete, augmented with a simple and brutal retry mechanism
A-2) A wait-free 2 step process, in which the first one creates the IP
allocation entry, and the second completes it by adjusting availability
ranges . Primary key violations will trigger retries.
A-3) A wait-free, lock-free 3 step process , which adopts
compare-and-swap  and the same election process as the bully algorithm
Unfortunately neutron does not create IP records for every address in a
subnet's CIDR - this would be fairly bad for IPv6 networks. Therefore we
cannot simply apply the simple CAS technique introduced in .
I then tested them under arbitrary concurrent load from multiple workers.
The tests were performed first on a single node backend, and then a 3-node
mysql Galera cluster, installed and configured using the official
The single-node tests showed, as it might have been expected, that the
algorithm leveraging db-level locks is way more efficient 
While it's pointless discussing the full results, one important aspect here
is that with db-level locks are a lot more "gentle" with the database, and
this result in consistently faster execution times.
With 20 concurrent threads, every thread completed in about 0.06 seconds
with db-level locks (A-1). With A-2 it took about 0.45 seconds, and with
A-3 0.32. With A-2 we saw over 500 queries, a little more than 200 with
A-3, and just a bit more than 100 with A-1.
It's worth noting that what makes A-3 faster than A-2 is a random
allocation strategy which drastically reduces collisions. A-3's performance
with sequential allocations are much worse (the avg. thread runtime with 20
concurrent threads is about 0.75 seconds)
With the test on the Galera cluster I was expecting a terrible slowdown in
A-1 because of deadlocks caused by certification failures. I was extremely
disappointed that the slowdown I measured however does not make any of the
other algorithms a viable alternative.
On the Galera cluster I did not run extensive collections for A-2. Indeed
primary key violations seem to triggers db deadlock because of failed write
set certification too (but I have not yet tested this).
I run tests with 10 threads on each node, for a total of 30 workers. Some
results are available at . There was indeed a slow down in A-1 (about
20%), whereas A-3 performance stayed pretty much constant. Regardless, A-1
was still at least 3 times faster than A-3.
As A-3's queries are mostly select (about 75% of them) use of caches might
make it a lot faster; also the algorithm is probably inefficient and can be
optimised in several areas. Still, I suspect it can be made faster than
A-1. At this stage I am leaning towards adoption db-level-locks with
retries for Neutron's IPAM. However, since I never trust myself, I wonder
if there is something important that I'm neglecting and will hit me down
In the medium term, there are a few things we might consider for Neutron's
1) Move the allocation logic out of the driver, thus making IPAM an
independent service. The API workers will then communicate with the IPAM
service through a message bus, where IP allocation requests will be
2) Use 3-party software as dogpile, zookeeper but even memcached to
implement distributed coordination. I have nothing against it, and I reckon
Neutron can only benefit for it (in case you're considering of arguing that
"it does not scale", please also provide solid arguments to support your
claim!). Nevertheless, I do believe API request processing should proceed
undisturbed as much as possible. If processing an API requests requires
distributed coordination among several components then it probably means
that an asynchronous paradigm is more suitable for that API request.
Thanks for reading through this email.
(you can read it with the parse_results.py utility available in the same
(check results_node_1.json and results_node_2.json as well)
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the OpenStack-dev