[openstack-dev] [Fuel][HA][AMQP] RabbitMQ failover and downtime fixes

Bogdan Dobrelya bdobrelia at mirantis.com
Mon May 25 11:55:18 UTC 2015


Folks,

JFYI: There were several major RabbitMQ HA failover related bugs fixed
for the Fuel 6.1 release scope. Short story:
1) the AMQP cluster failover time was dramatically shortened from ~350
to ~220 seconds in average.
2) there is *no more* a full cluster downtime expected while the
failover is in progress.

And these are about to be shortly backported for the 5.1.x/6.0.x
milestones as well.

Long story:
* RabbiMQ fence daemon startup bug [0]. W/o this daemon running, the
rabbit node failover time was *significantly* higher.
* Fix for the full RabbitMQ cluster downtime issue [1] for the master of
the multistate pacemaker resource failover. W/o this fix, all of the
rabbit nodes would have been kept down until the failover finished.
* Decreased mnesia_table_loading_timeout to 10 seconds [2]. This makes
the failover a bit faster.
* Incomplete mnesia files removal [3]. W/or this fix, the rabbit app may
sometimes fail to start.
* Some other fixes in the OCF logic for demote/stop/promote actions [4]
(ready for review, testing in progress). W/o these fixes, the failover
time was much longer than it should be and sometimes it could even fail
and require manual steps (restarting the RabbitMQ cluster resource in
pacemaker) to finish.

Also, there were several fixes related to the bug [5] merged: [6], [7]
but there is still an issue in the OCF script design persist. Which is,
a node might sometimes have missed its join event and the OCF action
monitor might not detect this as the RabbitMQ pacemaker resource agent
keeps the rabbit app stopped unless it is really safe to be started.
Hence, the monitor/start/promote actions must be drastically redesigned
in oder to get this done. This issue may happen not very often, for
example, for the long run failover test I've been running for a while,
it may appear at the 23rd iteration and looks completely random.

Note, there are no additional troubleshooting steps required to be
described in the ops documentation as related patch [8] covers this case
as well. Although, these changes require an update for the RabbitMQ
clustering flow charts [9] (in progress).

[0] https://launchpad.net/bugs/1456791
[1] https://bugs.launchpad.net/fuel/+bug/1436812
[2] https://review.openstack.org/184671
[3] https://bugs.launchpad.net/fuel/+bug/1457766
[4] https://review.openstack.org/185044
[5] https://bugs.launchpad.net/fuel/+bug/1455761
[6] https://review.openstack.org/184911
[7] https://review.openstack.org/184671
[8] https://review.openstack.org/184014
[9] http://goo.gl/PPNrw7

-- 
Best regards,
Bogdan Dobrelya,
Skype #bogdando_at_yahoo.com
Irc #bogdando





More information about the OpenStack-dev mailing list