[Openstack-operators] Openstack HA active/passive vs. active/active
Jay Pipes
jaypipes at gmail.com
Tue Nov 26 15:51:33 UTC 2013
On 11/26/2013 07:26 AM, Alvise Dorigo wrote:
> Hello,
> I've read the documentation about Openstack HA
> (http://docs.openstack.org/high-availability-guide/content/index.html)
> and I successfully implemented the active/passive model (with
> corosync/pacemaker) for the two services Keystone and Glance (MySQL HA
> is based on Percona-XtraDB multi-master).
>
> I'd like to know from the experts, which one is the best (and possibly
> why) model for HA, between active/passive and active/active, basing on
> their usage experience (that is for sure longer than mine).
There is no reason to run any OpenStack endpoint -- other than the
Neutron L3 agent -- in an active/passive way. The reason is because none
of the OpenStack endpoints maintain any state. The backend storage
systems used by those endpoints *do* contain state -- but the endpoint
services themselves do not.
Simply front each OpenStack endpoint with a DNS name that resolves to a
virtual IP managed by a load balancer, ensure that sessions are managed
by the load balancer, and you're good.
For the Neutron L3 agent, you will need a separate strategy, because
unfortunately, the L3 agent is stateful. We use a number of Python
scripts to handle failover of routes when an agent fails. You can see
these tools here, which we simply add as a cron job:
https://github.com/stackforge/cookbook-openstack-network/blob/master/files/default/quantum-ha-tool.py
My advice would be to continue using Percona XtraDB for your database
backend (we use the same in a variety of ways, from
intra-deployment-zone clusters to WAN-replicated clusters). That solves
your database availability issues, and nicely, we've found PXC to be as
easy or easier to administer and keep in sync than normal MySQL replication.
For your message queue, you need to determine a) what level of data loss
you are comfortable with, and b) whether to use certain OpenStack
projects' ability to retry multiple MQ hosts in the event of a failure
(currently Nova, Neutron and Cinder support this but Glance does not, IIRC).
We use RabbitMQ clustering and have had numerous problems with it,
frankly. It's been our pain point from an HA perspective. There are
other clustering MQ technologies out there, of course. Frankly, one
could write a whole book just about how crappy the MQ clustering "story"
is...
All the best,
-jay
More information about the OpenStack-operators
mailing list