[openstack-dev] [Cinder] A possible solution for HA Active-Active

Gorka Eguileor geguileo at redhat.com
Fri Jul 31 11:35:31 UTC 2015


On Fri, Jul 31, 2015 at 01:47:22AM -0700, Mike Perez wrote:
> On Mon, Jul 27, 2015 at 12:35 PM, Gorka Eguileor <geguileo at redhat.com> wrote:
> > I know we've all been looking at the HA Active-Active problem in Cinder
> > and trying our best to figure out possible solutions to the different
> > issues, and since current plan is going to take a while (because it
> > requires that we finish first fixing Cinder-Nova interactions), I've been
> > looking at alternatives that allow Active-Active configurations without
> > needing to wait for those changes to take effect.
> >
> > And I think I have found a possible solution, but since the HA A-A
> > problem has a lot of moving parts I ended up upgrading my initial
> > Etherpad notes to a post [1].
> >
> > Even if we decide that this is not the way to go, which we'll probably
> > do, I still think that the post brings a little clarity on all the
> > moving parts of the problem, even some that are not reflected on our
> > Etherpad [2], and it can help us not miss anything when deciding on a
> > different solution.
> 
> Based on IRC conversations in the Cinder room and hearing people's
> opinions in the spec reviews, I'm not convinced the complexity that a
> distributed lock manager adds to Cinder for both developers and the
> operators who ultimately are going to have to learn to maintain things
> like Zoo Keeper as a result is worth it.

Hi Mike,

I think you are right in bringing up the cost that adding a DLM to the
solution brings to operators, as it is something important to take into
consideration, and I would like to say that Ceilometer is already using
Tooz so operators are already familiar with these DLM, but unfortunately
that would be stretching the truth, since Cinder is present in 73% of
OpenStack production workloads while Ceilometer is only in 33% of them,
so we would be certainly disturbing some operators.

But we must not forget that the only operators that would need to worry
about deploying and maintaining the DLM are those wanting to deploy
Active-Active configurations (for Active-Passive configuration Tooz will
be working with local file locks like we are doing now), and some of
those may think like Duncan does: "I already have to administer rabbit,
mysql, backends, horizon, load ballancers, rate limiters...  adding
redis isn't going to make it that much harder".

That's why I don't think this is such a big deal for the vast majority
of operators.

On the developer side I have to disagree, there is no difference between
using Tooz and using current oslo synchronization mechanism for non
Active-Active deployments.

> 
> **Key point**: We're not scaling Cinder itself, it's about scaling to
> avoid build up of operations from the storage backend solutions
> themselves.

You must also consider that Active-Active solution will help deployments
where downtime is not an option or have SLAs with uptime or operational
requirements, it's not only about increasing volume of operations and
reducing times.

> 
> Whatever people think ZooKeeper "scaling level" is going to accomplish
> is not even a question. We don't need it, because Cinder isn't as
> complex as people are making it.
> 
> I'd like to think the Cinder team is a great in recognizing potential
> cross project initiatives. Look at what Thang Pham has done with
> Nova's version object solution. He made a generic solution into an
> Oslo solution for all, and Cinder is using it. That was awesome, and
> people really appreciated that there was a focus for other projects to
> get better, not just Cinder.

To be fair, Tooz is just one of those cross project initiatives you are
describing, it's a generic solution that can be used in all projects,
not just Ceilometer.

> 
> Have people consider Ironic's hash ring solution? The project Akanda
> is now adopting it [1], and I think it might have potential. I'd
> appreciate it if interested parties could have this evaluated before
> the Cinder midcycle sprint next week, to be ready for discussion.
> 

I will have a look at the hash ring solution you mention and see if it
makes sense to use it.

And I would really love to see the HA A-A discussion enabled for remote
people, as some of us are interested in the discussion but won't be able
to attend.  In my case problems with living in the Old World  :-(

In a way I have to agree with you that sometimes we make Cinder look
more complex than it really is, and in my case the solution I proposed
in the post was way too complex as it has been pointed out.  I just
tried to solve de A-A problem and fix some other issues like recovering
lost jobs (those waiting for locks) at the same time.

There is an alternative solution I am considering that will be much
simpler and will align with Walter's efforts to remove locks from the
Volume Manager.  I just need to give it a hard think to make sure the
solution has all bases covered.

The main reason why I am suggesting using Tooz and a DLM is because I
think it will allow us to reach Active-Active faster and with less
effort, not because I think it will fix all our problems or that we'll
have to keep using it forever.  It's basically replacing our current
local locks.

As I see the road of HA A-A for Cinder would look like:

Step 1: Get A-A with Tooz locks and a DLM.  There are other pieces of
the puzzle to solve this, but those pieces will carry on to the final
solution.

Step 2: Remove locks from the manager, here we'll be keeping locks in
drivers.

Step 3: See what drivers can work without locks in Active-Passive
configurations (for example LVM will still need local file locks to work
as seen in bug #1460692) and in Active-Active configurations there may
be some file based solutions that require additional locks.

Looking for an alternative solution to DLM will require more work and
bring more bugs into the code, and for what? After all we are going to
get rid of any additional mechanism in the manager and just use the DB
to return resource is busy errors.

We know that our current locking mechanism works, lets use that to our
advantage for a little while.

If people still think we should not go with a DLM I'll write a proposal
that doesn't need it, but it's going to be more work until we can see an
Active-Active configuration working and we'll probably still need a DLM
for some drivers.

Cheers,
Gorka.

PS: I have given a good thought to the solution you proposed the other
day and I can discuss it now.

> [1] - https://review.openstack.org/#/c/195366/
> 
> --
> Mike Perez
> 
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



More information about the OpenStack-dev mailing list