[openstack-dev] [Cinder] A possible solution for HA Active-Active

Gorka Eguileor geguileo at redhat.com
Wed Aug 5 09:03:37 UTC 2015


On Tue, Aug 04, 2015 at 08:40:13AM -0700, Joshua Harlow wrote:
> Clint Byrum wrote:
> >Excerpts from Devananda van der Veen's message of 2015-08-03 08:53:21 -0700:
> >>On Mon, Aug 3, 2015 at 8:41 AM Joshua Harlow<harlowja at outlook.com>  wrote:
> >>
> >>>Clint Byrum wrote:
> >>>>Excerpts from Gorka Eguileor's message of 2015-08-02 15:49:46 -0700:
> >>>>>On Fri, Jul 31, 2015 at 01:47:22AM -0700, Mike Perez wrote:
> >>>>>>On Mon, Jul 27, 2015 at 12:35 PM, Gorka Eguileor<geguileo at redhat.com>
> >>>wrote:
> >>>>>>>I know we've all been looking at the HA Active-Active problem in
> >>>Cinder
> >>>>>>>and trying our best to figure out possible solutions to the different
> >>>>>>>issues, and since current plan is going to take a while (because it
> >>>>>>>requires that we finish first fixing Cinder-Nova interactions), I've
> >>>been
> >>>>>>>looking at alternatives that allow Active-Active configurations
> >>>without
> >>>>>>>needing to wait for those changes to take effect.
> >>>>>>>
> >>>>>>>And I think I have found a possible solution, but since the HA A-A
> >>>>>>>problem has a lot of moving parts I ended up upgrading my initial
> >>>>>>>Etherpad notes to a post [1].
> >>>>>>>
> >>>>>>>Even if we decide that this is not the way to go, which we'll probably
> >>>>>>>do, I still think that the post brings a little clarity on all the
> >>>>>>>moving parts of the problem, even some that are not reflected on our
> >>>>>>>Etherpad [2], and it can help us not miss anything when deciding on a
> >>>>>>>different solution.
> >>>>>>Based on IRC conversations in the Cinder room and hearing people's
> >>>>>>opinions in the spec reviews, I'm not convinced the complexity that a
> >>>>>>distributed lock manager adds to Cinder for both developers and the
> >>>>>>operators who ultimately are going to have to learn to maintain things
> >>>>>>like Zoo Keeper as a result is worth it.
> >>>>>>
> >>>>>>**Key point**: We're not scaling Cinder itself, it's about scaling to
> >>>>>>avoid build up of operations from the storage backend solutions
> >>>>>>themselves.
> >>>>>>
> >>>>>>Whatever people think ZooKeeper "scaling level" is going to accomplish
> >>>>>>is not even a question. We don't need it, because Cinder isn't as
> >>>>>>complex as people are making it.
> >>>>>>
> >>>>>>I'd like to think the Cinder team is a great in recognizing potential
> >>>>>>cross project initiatives. Look at what Thang Pham has done with
> >>>>>>Nova's version object solution. He made a generic solution into an
> >>>>>>Oslo solution for all, and Cinder is using it. That was awesome, and
> >>>>>>people really appreciated that there was a focus for other projects to
> >>>>>>get better, not just Cinder.
> >>>>>>
> >>>>>>Have people consider Ironic's hash ring solution? The project Akanda
> >>>>>>is now adopting it [1], and I think it might have potential. I'd
> >>>>>>appreciate it if interested parties could have this evaluated before
> >>>>>>the Cinder midcycle sprint next week, to be ready for discussion.
> >>>>>>
> >>>>>>[1] - https://review.openstack.org/#/c/195366/
> >>>>>>
> >>>>>>-- Mike Perez
> >>>>>Hi all,
> >>>>>
> >>>>>Since my original proposal was more complex that it needed be I have a
> >>>>>new proposal of a simpler solution, and I describe how we can do it with
> >>>>>or without a DLM since we don't seem to reach an agreement on that.
> >>>>>
> >>>>>The solution description was more rushed than previous one so I may have
> >>>>>missed some things.
> >>>>>
> >>>>>http://gorka.eguileor.com/simpler-road-to-cinder-active-active/
> >>>>>
> >>>>I like the idea of keeping it simpler Gorka. :)
> >>>>
> >>>>Note that this is punting back to "use the database for coordination",
> >>>>which is what most projects have done thus far, and has a number of
> >>>>advantages and disadvantages.
> >>>>
> >>>>Note that the stale-lock problem was solved in an interesting way in
> >>>Heat:
> >>>>each engine process gets an "instance-of-engine" uuid that adds to the
> >>>>topic queues it listens to. If it holds a lock, it records this UUID in
> >>>>the owner field. When somebody wants to steal the lock (due to timeout)
> >>>>they send to this queue, and if there's no response, the lock is stolen.
> >>>>
> >>>>Anyway, I think what might make more sense than copying that directly,
> >>>>is implementing "Use the database and oslo.messaging to build a DLM"
> >>>>as a tooz backend. This way as the negative aspects of that approach
> >>>>impact an operator, they can pick a tooz driver that satisfies their
> >>>>needs, or even write one to their specific backend needs.
> >>>Oh jeez, using 'the database and oslo.messaging to build a DLM' scares
> >>>me :-/
> >>>
> >>>There are already N + 1 DLM like-systems out there (and more every day
> >>>if u consider the list at
> >>>https://raftconsensus.github.io/#implementations) so I'd really rather
> >>>use one that is proven to work by academia vs make a frankenstein one.
> >>>
> >>>
> >>Joshua,
> >>
> >>As has been said on this thread, some projects (eg, Ironic) are already
> >>using a consistent hash ring backed by a database to meet the requirements
> >>they have. Could those requirements also be met with some other tech? Yes.
> >>Would that provide additional functionality or some other benefits? Maybe.
> >>But that's not what this thread was about.
> >>
> >>Distributed hash rings are a well understood technique, as are databases.
> >>There's no need to be insulting by calling
> >>not-your-favorite-technology-of-the-day a frankenstein one.
> >>
> >>The topic here, which I've been eagerly following, is whether or not Cinder
> >>needs to use a DLM *at all*. Until that is addressed, discussing specific
> >>DLM or distributed KVS is not necessary.
> >>
> >
> >The hash ring has its own set of problems and it is not a magic pill. As
> >I said before and you know fully, Ironic uses a _distributed lock_ in the
> >database to make sure its very efficient hash ring doesn't accidentally
> >'splode the datacenter. And when those locks go stale, there are times
> >where manual intervention is required. The ways around that involve
> >communicating about hosts via the database or messaging layer. This is
> >also the way Heat does it. I'm sure others have done the same. All in
> >the name of avoiding efficient DLM's that exist and are already available
> >and robust and have built in mechanisms for this.
> >
> >So, I do think adding app-specific oslo.messaging calls to automate stale
> >database lock handling is fine. We all have jobs to do and we can't spend
> >all of our time asking all the other projects if they agree. However,
> >what's at question here is whether we should perhaps start looking for
> >better solutions that will benefit everyone. Now that we have a few
> >examples of OpenStack services reimplementing chunks of RAFT, do we want
> >to maybe consider just using well known common implementations of it?
> >
> >Also on a side note, I think Cinder's need for this is really subtle,
> >and one could just accept that sometimes it's going to break when it does
> >two things to one resource from two hosts. The error rate there might
> >even be lower than the false-error rate that would be caused by a twitchy
> >DLM with timeouts a little low. So there's a core cinder discussion that
> >keeps losing to the shiny DLM discussion, and I'd like to see it played
> >out fully: Could Cinder just not do anything, and let the few drivers
> >that react _really_ badly, implement their own concurrency controls?
> 
> +1
> 
> Great question, I wonder what the cinder folks think... Although I do start
> to wonder if the driver folks would be very angry since its really hard to
> take a cookie (aka the ability to have locks) back once it has already been
> given to them.
> 

I don't think Cinder should just let drivers on their own to look for
concurrency controls, as this has the potential to have 10 different
solutions to the same problem and would make not only driver developers'
life harder, but also OpenStack administrators and reviewers as well.

If we remove DLM based locks Cinder would still need to provide some
kind of distributed locking mechanism to those drivers (even if in the
background Cinder is doing it using the DB).

Cheers,
Gorka.



More information about the OpenStack-dev mailing list