[openstack-dev] [Neutron] db-level locks, non-blocking algorithms, active/active DB clusters and IPAM

Salvatore Orlando sorlando at nicira.com
Wed Feb 25 16:33:04 UTC 2015


On 25 February 2015 at 16:52, Carl Baldwin <carl at ecbaldwin.net> wrote:

> jOn Mon, Feb 23, 2015 at 5:07 AM, Salvatore Orlando <sorlando at nicira.com>
> wrote:
> > Lazy-Stacker summary:
> > I am doing some work on Neutron IPAM code for IP Allocation, and I need
> to
> > found whether it's better to use db locking queries (SELECT ... FOR
> UPDATE)
> > or some sort of non-blocking algorithm.
> > Some measures suggest that for this specific problem db-level locking is
> > more efficient even when using multi-master DB clusters, which kind of
> > counters recent findings by other contributors [2]... but also backs
> those
> > from others [7].
> >
> > The long and boring thing:
> >
> > In the past few months there's been a fair amount of discussion
> concerning
> > the use of locking queries (ie: SELECT ... FOR UPDATE) in various
> OpenStack
> > projects, especially in connection with the fact that this kind of
> queries
> > is likely to trigger write set certification failures in synchronous db
> > replication engines such as MySQL Galera - as pointed out in [1] and
> > thoroughly analysed in [2].
> > Neutron, in particular, uses this construct often - and the correctness
> of
> > the whole IPAM logic depends on it. On this regard Neutron is also
> guilty of
> > not implementing any sort of retry mechanism. Eugene, Rossella, and
> others
> > have been working towards fixing this issue. Some examples of their work
> can
> > be found at [3] and [4].
> >
> > However, techniques such as "compare and swap" described in [2] can be
> used
> > for implementing non-blocking algorithms which avoid at all the write-set
> > certification issue.
> > A neutron example of such technique can be found in [5]. However, a bug
> in
> > this process found by Attila [6], and the subsequent gerrit discussion
> [7],
> > raised further questions on locking vs non-blocking solutions.
> > At this stage, we know that using database-level locks triggers write-set
> > certification failures in synchronous multi master modes. This failures
> are
> > reported as deadlock errors. Such errors can be dealt with retrying the
> > transaction. However, this is claimed as not efficient in [2], whereas
> > Attila in [7] argues the opposite.
> >
> > As I'm rewriting Neutron's IPAM logic as a part of [8], which heavily
> relies
> > on this construct, I decided to spend some time to put everything to the
> > test.
> > To this aim I devised 3 algorithms.
> >
> > A-1) The classic one which locks the availability ranges for a subnet
> until
> > allocation is complete, augmented with a simple and brutal retry
> mechanism
> > [9].
>
> So, similar to today's implementation.
>
> > A-2) A wait-free 2 step process, in which the first one creates the IP
> > allocation entry, and the second completes it by adjusting availability
> > ranges [10]. Primary key violations will trigger retries.
>
> My summary:  queries RESERVED ip requests and subtracts them from
> availability.  Attempts to generate an IP from the resulting set.
> Adjust availability in an idempotent way as a post-step.  Does that
> sound about right?
>

Correct.

>
> Seems there are four flavors of this.  Did you compare them?  Which
> one did you use to compare with the other two?
>

There are indeed 4 flavors. Random and sequential, and for each one of them
with or without a preliminary range check to verify whether the IP address
has already been removed by a concurrent thread and hence skip the process.

Random allocation works terribly with availability ranges. This is mainly
because adjusting them with random allocation in the great majority of
cases means doing 1 update + 1 insert query at least, whereas sequential
allocation usually results in a single update query.
The range check before retrying brings improvements in the majority of
cases tested in this experimental campaign. However, its effectiveness
decreases when the concurrency level is low, so it might not be the best
solution.


> > A-3) A wait-free, lock-free 3 step process [11], which adopts
> > compare-and-swap [2] and the same election process as the bully algorithm
> > [12].
>
> Also, 2 flavors of this one:  random and sequential.  I'll admit I did
> not fully grok this implementation.  It looks to me like it would be
> inefficient.
>

It sounds inefficient because we are making very little assumptions on the
underlying database. I think its effectiveness can be improved;
nevertheless is its complexity that worries me.

>
> > Unfortunately neutron does not create IP records for every address in a
> > subnet's CIDR - this would be fairly bad for IPv6 networks. Therefore we
> > cannot simply apply the simple CAS technique introduced in [2].
>
> What if it created records for only N addresses and did the same sort
> of thing with those?  That would mitigate the issue with IPv6 and
> large IPv4 subnets.  If those N are exhausted then we'd have to run a
> routine to create more records.  Multiple workers could run that
> routine at the same time but I imagine that they would all attempt to
> insert the same set of new records and only one would win.  The others
> would hit key violations and try to allocate again.
>

I thought of the same strategy but did not implement it. Nevertheless it
might be a good candidate, and we should put it to the test.
In this case we should test how it scales with subnets where a large
amounts of IPs is allocated. We need to ensure select queries are optimized
as they might analyse tens of thousands of records or even more.
The aspect of extending the pool of IPs when the chunk is exhausted is
another interesting one. As you note, multiple workers might crash there.
In my opinion since this is an unfrequent operation we might find a way to
designate a single worker which does this kind of thing. This is something
neutron is currently unable to do, but it should not be difficult.

My intention was to have a chunk of 100 addresses and a minimum spare chunk
size of 50 - which would give up to 50 concurrent allocations without
issues.. Whenever the amount of free IPs would go below 50, I would trigger
a thread that added more address until the number of available one went
back up to 100.

One more thing that we should watch out for are specific IP allocations.
It's not hard to reproduce a situation where one allocates  a specific IP
and at the same time that IP is being allocated for a chunk.

Iif you want to add this algorithm I'd be happy to test it.


>
> > I then tested them under arbitrary concurrent load from multiple workers.
> > The tests were performed first on a single node backend, and then a
> 3-node
> > mysql Galera cluster, installed and configured using the official
> > documentation [13].
> > The single-node tests showed, as it might have been expected, that the
> > algorithm leveraging db-level locks is way more efficient [14]
> > While it's pointless discussing the full results, one important aspect
> here
> > is that with db-level locks are a lot more "gentle" with the database,
> and
> > this result in consistently faster execution times.
> > With 20 concurrent threads, every thread completed in about 0.06 seconds
> > with db-level locks (A-1). With A-2 it took about 0.45 seconds, and with
> A-3
> > 0.32. With A-2 we saw over 500 queries, a little more than 200 with A-3,
> and
> > just a bit more than 100 with A-1.
> > It's worth noting that what makes A-3 faster than A-2 is a random
> allocation
> > strategy which drastically reduces collisions. A-3's performance with
> > sequential allocations are much worse (the avg. thread runtime with 20
> > concurrent threads is about 0.75 seconds)
> >
> > With the test on the Galera cluster I was expecting a terrible slowdown
> in
> > A-1 because of deadlocks caused by certification failures. I was
> extremely
> > disappointed that the slowdown I measured however does not make any of
> the
> > other algorithms a viable alternative.
> > On the Galera cluster I did not run extensive collections for A-2. Indeed
> > primary key violations seem to triggers db deadlock because of failed
> write
> > set certification too (but I have not yet tested this).
> > I run tests with 10 threads on each node, for a total of 30 workers. Some
> > results are available at [15]. There was indeed a slow down in A-1 (about
> > 20%), whereas A-3 performance stayed pretty much constant. Regardless,
> A-1
> > was still at least 3 times faster than A-3.
> > As A-3's queries are mostly select (about 75% of them) use of caches
> might
> > make it a lot faster; also the algorithm is probably inefficient and can
> be
> > optimised in several areas. Still, I suspect it can be made faster than
> A-1.
> > At this stage I am leaning towards adoption db-level-locks with retries
> for
> > Neutron's IPAM. However, since I never trust myself, I wonder if there is
> > something important that I'm neglecting and will hit me down the road.
>
> Wondering if Kilo should just focus on creating the interface which
> will allow us to create multiple implementations and swap them out
> during the Liberty development cycle.  Hopefully, this could include
> even something like your option 2 below.
>
> I think I would go with the locking solution first because it is most
> like what we have today.
>

Indeed. I realised the way you suggest here is the best for Kilo while
implementing those algorithms.


>
> > In the medium term, there are a few things we might consider for
> Neutron's
> > "built-in IPAM".
> > 1) Move the allocation logic out of the driver, thus making IPAM an
> > independent service. The API workers will then communicate with the IPAM
> > service through a message bus, where IP allocation requests will be
> > "naturally serialized"
> > 2) Use 3-party software as dogpile, zookeeper but even memcached to
> > implement distributed coordination. I have nothing against it, and I
> reckon
> > Neutron can only benefit for it (in case you're considering of arguing
> that
> > "it does not scale", please also provide solid arguments to support your
> > claim!). Nevertheless, I do believe API request processing should proceed
> > undisturbed as much as possible. If processing an API requests requires
> > distributed coordination among several components then it probably means
> > that an asynchronous paradigm is more suitable for that API request.
>
> I think we need to look in to these options for Liberty.  I have had
> both options on my mind for a while.  I think I lean toward option 2
> first and option 1 as a last resort.  What do you think?
>

Ideally I'd lean towards option 2 as well. The current structure of the
code makes me thing we need to settle for option 1, which is decent anyway,
especially if the plan around using multimaster clusters as master/slave
wrt writes goes ahead.

>
> > Thanks for reading through this email.
>
> You're welcome.  :)
>
> Carl
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20150225/c9d5dc73/attachment.html>


More information about the OpenStack-dev mailing list