[Openstack-operators] Openstack HA active/passive vs. active/active

Alvise Dorigo alvise.dorigo at pd.infn.it
Wed Nov 27 18:02:54 UTC 2013


Hi Jay, thanks a lot for your rich answer. More comments and questions inline...

On 26 Nov 2013, at 16:51, Jay Pipes <jaypipes at gmail.com> wrote:

> On 11/26/2013 07:26 AM, Alvise Dorigo wrote:
>> Hello,
>> I've read the documentation about Openstack HA
>> (http://docs.openstack.org/high-availability-guide/content/index.html)
>> and I successfully implemented the active/passive model (with
>> corosync/pacemaker) for the two services Keystone and Glance (MySQL HA
>> is based on Percona-XtraDB multi-master).
>> 
>> I'd like to know from the experts, which one is the best (and possibly
>> why) model for HA, between active/passive and active/active, basing on
>> their usage experience (that is for sure longer than mine).
> 
> There is no reason to run any OpenStack endpoint -- other than the Neutron L3 agent -- in an active/passive way. The reason is because none of the OpenStack endpoints maintain any state. The backend storage systems used by those endpoints *do* contain state -- but the endpoint services themselves do not.
> 

So, in principle I could simply install a cloud controller (with Keystone, Glance, Nova API, Cinder) and just clone it on another machine. Then I could put an HAProxy (made redundant with Keepalived) on top of them. (A different story would be for Neutron L3 agent for which an active/passive mode is preferable, as you pointed out).
Does this make sense ?

> Simply front each OpenStack endpoint with a DNS name that resolves to a virtual IP managed by a load balancer, ensure that sessions are managed by the load balancer, and you're good.
> 
> For the Neutron L3 agent, you will need a separate strategy, because unfortunately, the L3 agent is stateful. We use a number of Python scripts to handle failover of routes when an agent fails. You can see these tools here, which we simply add as a cron job:
> 
> https://github.com/stackforge/cookbook-openstack-network/blob/master/files/default/quantum-ha-tool.py
> 
> My advice would be to continue using Percona XtraDB for your database backend (we use the same in a variety of ways, from intra-deployment-zone clusters to WAN-replicated clusters). That solves your database availability issues, and nicely, we've found PXC to be as easy or easier to administer and keep in sync than normal MySQL replication.
> 

Definitely. It showed to be as robust as we expected. And, in addition, the combination of Percona+HAProxy makes possible the expansion (substitution) of nodes without any outage period; for example if we need to increase the cluster performances (more CPU, more RAM, more disk)… needless to mention the RR balancing, which comes for free.

> For your message queue, you need to determine a) what level of data loss you are comfortable with, and b) whether to use certain OpenStack projects' ability to retry multiple MQ hosts in the event of a failure (currently Nova, Neutron and Cinder support this but Glance does not, IIRC).
> 

What about having an instance of QPid per node ? As far as I know, qpid also is stateless, isn’ it ? In my active/passive actual cluster I’ve qpid running on both nodes and when I migrate the keystone/glance from on node to the other I do not note anything strange. Do you see any drawback with this ?

Thanks,

	Alvise

> We use RabbitMQ clustering and have had numerous problems with it, frankly. It's been our pain point from an HA perspective. There are other clustering MQ technologies out there, of course. Frankly, one could write a whole book just about how crappy the MQ clustering "story" is...
> 
> All the best,
> -jay
> 
> 
> _______________________________________________
> OpenStack-operators mailing list
> OpenStack-operators at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators




More information about the OpenStack-operators mailing list