[openstack-dev] [Cinder] A possible solution for HA Active-Active

Clint Byrum clint at fewbar.com
Wed Aug 5 15:00:24 UTC 2015


Excerpts from Philipp Marek's message of 2015-08-05 00:10:30 -0700:
> 
> > >Well, is it already decided that Pacemaker would be chosen to provide HA in
> > >Openstack? There's been a talk "Pacemaker: the PID 1 of Openstack" IIRC.
> > >
> > >I know that Pacemaker's been pushed aside in an earlier ML post, but IMO
> > >there's already *so much* been done for HA in Pacemaker that Openstack
> > >should just use it.
> > >
> > >All HA nodes needs to participate in a Pacemaker cluster - and if one node
> > >looses connection, all services will get stopped automatically (by
> > >Pacemaker) - or the node gets fenced.
> > >
> > >
> > >No need to invent some sloppy scripts to do exactly the tasks (badly!) that
> > >the Linux HA Stack has been providing for quite a few years.
> > So just a piece of information, but yahoo (the company I work for, with vms
> > in the tens of thousands, baremetal in the much more than that...) hasn't
> > used pacemaker, and in all honesty this is the first project (openstack)
> > that I have heard that needs such a solution. I feel that we really should
> > be building our services better so that they can be A-A vs having to depend
> > on another piece of software to get around our 'sloppiness' (for lack of a
> > better word).
> > 
> > Nothing against pacemaker personally... IMHO it just doesn't feel like we
> > are doing this right if we need such a product in the first place.
> Well, Pacemaker is *the* Linux HA Stack.
> 

I'm not sure it's wise to claim the definite article for anything in
Open Source. :)

That said, it's certainly the most mature, and widely accepted.

> So, before trying to achieve similar goals by self-written scripts (and 
> having to re-discover all the gotchas involved), it would be much better to 
> learn from previous experiences - even if they are not one's own.
> 
> Pacemaker has eg. the concept of clones[1] - these define services that run 
> multiple instances within a cluster. And behold! the instances get some 
> Pacemaker-internal unique id[2], which can be used to do sharding.
> 
> 
> Yes, that still means that upon service or node crash the failed instance 
> has to be started on some other node; but as that'll typically be up and 
> running already, the startup time should be in the range of seconds.
> 
> 
> We'd instantly get
>  * a supervisor to start/stop/restart/fence/monitor the service(s)
>  * node/service failure detection
>  * only small changes needed in the services
>  * and all that in a tested software that's available in all distributions,
>    and that already has its own testsuite...
> 
> 
> If we decide that this solution won't fulfill all our expectations, fine -
> let's use something else.
> 
> But I don't think it makes *any* sense to try to redo some (existing) 
> High-Availability code in some quickly written scripts, just because it 
> looks easy - there are quite a few traps for the unwary.
> 

I think Keystone's dev team agrees with you, and also doesn't want to get
in the way of that with any half-baked solution. They give you all the
CLI tools and filesystem layouts to make this work perfectly. It would
be nice to even ship the pacemaker resources in a contrib directory and
run tests in the gate on them. But if users have some reason not to use
it, they shouldn't be force to use it.



More information about the OpenStack-dev mailing list