[openstack-dev] blueprint proposal nova-compute fencing for HA ?
Leen Besselink
ubuntu at consolejunkie.net
Tue Apr 23 22:52:20 UTC 2013
On Tue, Apr 23, 2013 at 12:16:36PM -0700, Devananda van der Veen wrote:
> Not to side track the conversation too much, but some of our mid- and
> long-term goals for TripleO align closely with this discussion.
>
> Specifically, we're working towards having Heat deploy HA pairs for all the
> important service bits inside an OpenStack cloud, and then using Heat to
> orchestrate no-downtime upgrades of those HA pairs. I believe folks are
> already working on this, but I don't think it's directly relevant to this
> discussion.
>
> At some point, TripleO will need a mechanism for no-downtime upgrades for
> nova-compute nodes as well. Whether that is an in-place restart, evacuate,
> or (ideally) live-migrate, Heat is going to need to drive it, which means
> that it needs to be manageable via the API. The same mechanism could
> presumably be tied into monitoring, trigger an evacuation if the compute
> host was down for a certain length of time, and presumably also coordinate
> with Heat not to start autoscaling at the same time. This would need to
> avoid split-brain situations and the like...
>
Devananda that is definitely related in certain ways.
I'm also thinking maybe when using Heat for TripleO can Heat be used to handle failures ?
Someone made an other related blueprint less than a day ago:
https://blueprints.launchpad.net/nova/+spec/evacuate-instance-automatically
OK, so I think we've established there is some kind of interrest in this. :-)
I've just created an EtherPad where I'm gonna try to collect some requirements and other information:
https://etherpad.openstack.org/openstack-instance-high-availability
Please, do help if this has your interest.
I've also added some ideas I got when I started to write down there requirements.
Hope this is helpful as a starting point.
It would love to know if certain things on that are just plain wrong/stupid as well.
Have a nice day,
Leen.
> Regards,
> Devananda
>
> (*)
> https://github.com/tripleo/incubator/blob/master/README.md#what-is-tripleo
>
>
> On Tuesday, April 23, 2013, Leen Besselink wrote:
>
> > On Tue, Apr 23, 2013 at 10:08:19AM -0400, Russell Bryant wrote:
> > > On 04/23/2013 03:31 AM, Leen Besselink wrote:
> > > >> I was only talking about the fencing off a compute node part, since
> > > >> that's what you started the thread with. :-)
> > > >
> > > > I know I'm going in circles, just trying to get a feel for the best
> > way to handle it.
> > > >
> > > >>
> > > >> Presumably you would still use nova APIs that already exist to move
> > the
> > > >> instances elsewhere. An 'evacuate' API went in to grizzly for this.
> > > >>
> > > >> https://blueprints.launchpad.net/nova/+spec/rebuild-for-ha
> > > >>
> > > >
> > > > So when any node fails in a Pacemaker cluster, you fence the node,
> > tell OpenStack about the
> > > > failed node and call evacuate for all the instances. The scheduler
> > will just place them anywhere
> > > > it pleases.
> > > >
> > > > (there is already a blueprint for evacuate to call the scheduler and
> > even an other for handling a whole node)
> > > >
> > > > So, yeah, maybe that is enough.
> > > >
> > > > I guess I was hoping all machines would be the same. Now I'll need to
> > make clusters. To OpenStack
> > > > they will still all look the same I guess.
> > > >
> > > > But it will work with existing tested code, that is also important.
> > >
> > > Yeah, it's not really ideal, but it's something that works with existing
> > > tools. I thought of a pretty big hole here, though. We want to
> > > restrict what a compute node can do as much as possible for security
> > > reasons. Clustering them together and allowing them to communicate back
> > > to the nova API to perform administrative functions (evacuating
> > > instances) is extremely contrary to that goal.
> > >
> > > In any case, I think the usage of fence-agents is good, but it should be
> > > something outside of the existing OpenStack components that uses it.
> > > Compute nodes need to be monitored, but they need to be restricted from
> > > having any administrative capabilities for security reasons.
> > >
> > > What should perform the monitoring, fencing, evacuating, and what not is
> > > a bit of a question mark. I think as a community we are seriously
> > > lacking good open source cloud infrastructure management tools.
> > > Companies are developing their own for private use, or as proprietary
> > > value adds, but we need some open solutions here.
> > >
> >
> > As Alex Glikson suggestion was to use ZooKeeper code in Nova which
> > centralized this.
> >
> > So security wise that could be a good start, you don't need anything on
> > the node itself. If it also simpler,
> > you don't need to create clusters.
> >
> > If OpenStack had a central service which handles fencing then that might
> > be enough.
> >
> > A simple fencing implementation would be that the service would just send
> > an IPMI poweroff request to
> > the node.
> >
> > Sends an other IPMI request to check if the current power state is off,
> > marks the node as down and then calls the
> > scheduler to start the instances somewhere else (evacuate).
> >
> > Or am I missing something ?
> >
> > > --
> > > Russell Bryant
> > >
> > > _______________________________________________
> > > OpenStack-dev mailing list
> > > OpenStack-dev at lists.openstack.org
> > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
> > _______________________________________________
> > OpenStack-dev mailing list
> > OpenStack-dev at lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
More information about the OpenStack-dev
mailing list