[Openstack-operators] Openstack HA active/passive vs. active/active

Jay Pipes jaypipes at gmail.com
Tue Nov 26 15:51:33 UTC 2013


On 11/26/2013 07:26 AM, Alvise Dorigo wrote:
> Hello,
> I've read the documentation about Openstack HA
> (http://docs.openstack.org/high-availability-guide/content/index.html)
> and I successfully implemented the active/passive model (with
> corosync/pacemaker) for the two services Keystone and Glance (MySQL HA
> is based on Percona-XtraDB multi-master).
>
> I'd like to know from the experts, which one is the best (and possibly
> why) model for HA, between active/passive and active/active, basing on
> their usage experience (that is for sure longer than mine).

There is no reason to run any OpenStack endpoint -- other than the 
Neutron L3 agent -- in an active/passive way. The reason is because none 
of the OpenStack endpoints maintain any state. The backend storage 
systems used by those endpoints *do* contain state -- but the endpoint 
services themselves do not.

Simply front each OpenStack endpoint with a DNS name that resolves to a 
virtual IP managed by a load balancer, ensure that sessions are managed 
by the load balancer, and you're good.

For the Neutron L3 agent, you will need a separate strategy, because 
unfortunately, the L3 agent is stateful. We use a number of Python 
scripts to handle failover of routes when an agent fails. You can see 
these tools here, which we simply add as a cron job:

https://github.com/stackforge/cookbook-openstack-network/blob/master/files/default/quantum-ha-tool.py

My advice would be to continue using Percona XtraDB for your database 
backend (we use the same in a variety of ways, from 
intra-deployment-zone clusters to WAN-replicated clusters). That solves 
your database availability issues, and nicely, we've found PXC to be as 
easy or easier to administer and keep in sync than normal MySQL replication.

For your message queue, you need to determine a) what level of data loss 
you are comfortable with, and b) whether to use certain OpenStack 
projects' ability to retry multiple MQ hosts in the event of a failure 
(currently Nova, Neutron and Cinder support this but Glance does not, IIRC).

We use RabbitMQ clustering and have had numerous problems with it, 
frankly. It's been our pain point from an HA perspective. There are 
other clustering MQ technologies out there, of course. Frankly, one 
could write a whole book just about how crappy the MQ clustering "story" 
is...

All the best,
-jay




More information about the OpenStack-operators mailing list