[openstack-dev] [TripleO] Strategy for recovering crashed nodes in the Overcloud?

Charles Crouch ccrouch at redhat.com
Tue Jul 22 17:14:29 UTC 2014



----- Original Message -----
> Hi,
> 
> I'm running a HA overcloud configuration and as far as I'm aware, there is
> currently no mechanism in place for restarting failed nodes in the cluster.
> Originally, I had been wondering if we would use a corosync/pacemaker
> cluster across the control plane with STONITH resources configured for each
> node (a STONITH plugin for Ironic could be written). 

I know some people are starting to look at how to use pacemaker for fencing/
recovery with TripleO, but I'm not aware of any proposals yet. 
I'm sure as soon as that is published it will hit this list.

>This might be fine if a
> corosync/pacemaker stack is already being used for HA of some components,
> but it seems overkill otherwise. 

There is a pending patch to add support for using pacemaker to deal with A/P
services: e.g. https://review.openstack.org/#/c/105397/
I'd expect additional patches like this in the future.

>The undercloud heat could be in a good
> position to restart the overcloud nodes -- is that the plan or are there
> other options being considered?
> 
> Thanks,
> Tom
> 
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 



More information about the OpenStack-dev mailing list