[openstack-dev] [Fuel] Speed Up RabbitMQ Recovering

Bogdan Dobrelya bdobrelia at mirantis.com
Tue Apr 28 13:15:46 UTC 2015


> Hello,

Hello, Zhou

>
> I using Fuel 6.0.1 and find that RabbitMQ recover time is long after
> power failure. I have a running HA environment, then I reset power of
> all the machines at the same time. I observe that after reboot it
> usually takes 10 minutes for RabittMQ cluster to appear running
> master-slave mode in pacemaker. If I power off all the 3 controllers and
> only start 2 of them, the downtime sometimes can be as long as 20 minutes.

Yes, this is a known issue [0]. Note, there were many bugfixes, like
[1],[2],[3], merged for MQ OCF script, so you may want to try to
backport them as well by the following guide [4]

[0] https://bugs.launchpad.net/fuel/+bug/1432603
[1] https://review.openstack.org/#/c/175460/
[2] https://review.openstack.org/#/c/175457/
[3] https://review.openstack.org/#/c/175371/
[4] https://review.openstack.org/#/c/170476/

>
> I have a little investigation and find out there are some possible causes.
>
> 1. MySQL Recovery Takes Too Long [1] and Blocking RabbitMQ Clustering in
> Pacemaker
>
> The pacemaker resource p_mysql start timeout is set to 475s. Sometimes
> MySQL-wss fails to start after power failure, and pacemaker would wait
> 475s before retry starting it. The problem is that pacemaker divides
> resource state transitions into batches. Since RabbitMQ is master-slave
> resource, I assume that starting all the slaves and promoting master are
> put into two different batches. If unfortunately starting all RabbitMQ
> slaves are put in the same batch as MySQL starting, even if RabbitMQ
> slaves and all other resources are ready, pacemaker will not continue
> but just wait for MySQL timeout.

Could you please elaborate the what is the same/different batches for MQ
and DB? Note, there is a MQ clustering logic flow charts available here
[5] and we're planning to release a dedicated technical bulletin for this.

[5] http://goo.gl/PPNrw7

>
> I can re-produce this by hard powering off all the controllers and start
> them again. It's more likely to trigger MySQL failure in this way. Then
> I observe that if there is one cloned mysql instance not starting, the
> whole pacemaker cluster gets stuck and does not emit any log. On the
> host of the failed instance, I can see a mysql resource agent process
> calling the sleep command. If I kill that process, the pacemaker comes
> back alive and RabbitMQ master gets promoted. In fact this long timeout
> is blocking every resource from state transition in pacemaker.
>
> This maybe a known problem of pacemaker and there are some discussions
> in Linux-HA mailing list [2]. It might not be fixed in the near future.
> It seems in generally it's bad to have long timeout in state transition
> actions (start/stop/promote/demote). There maybe another way to
> implement MySQL-wss resource agent to use a short start timeout and
> monitor the wss cluster state using monitor action.

This is very interesting, thank you! I believe all commands for MySQL RA
OCF script should be as well wrapped with timeout -SIGTERM or -SIGKILL
as we did for MQ RA OCF. And there should no be any sleep calls. I
created a bug for this [6].

[6] https://bugs.launchpad.net/fuel/+bug/1449542

>
> I also find a fix to improve MySQL start timeout [3]. It shortens the
> timeout to 300s. At the time I sending this email, I can not find it in
> stable/6.0 branch. Maybe the maintainer needs to cherry-pick it to
> stable/6.0 ?
>
> [1] https://bugs.launchpad.net/fuel/+bug/1441885
> [2] http://lists.linux-ha.org/pipermail/linux-ha/2014-March/047989.html
> [3] https://review.openstack.org/#/c/171333/
>
>
> 2. RabbitMQ Resource Agent Breaks Existing Cluster
>
> Read the code of the RabbitMQ resource agent, I find it does the
> following to start RabbitMQ master-slave cluster.
> On all the controllers:
> (1) Start Erlang beam process
> (2) Start RabbitMQ App (If failed, reset mnesia DB and cluster state)
> (3) Stop RabbitMQ App but do not stop the beam process
>
> Then in pacemaker, all the RabbitMQ instances are in slave state. After
> pacemaker determines the master, it does the following.
> On the to-be-master host:
> (4) Start RabbitMQ App (If failed, reset mnesia DB and cluster state)
> On the slaves hosts:
> (5) Start RabbitMQ App (If failed, reset mnesia DB and cluster state)
> (6) Join RabbitMQ cluster of the master host
>

Yes, something like that. As I mentioned, there were several bug fixes
in the 6.1 dev, and you can also check the MQ clustering flow charts.

> As far as I can understand, this process is to make sure the master
> determined by pacemaker is the same as the master determined in RabbitMQ
> cluster. If there is no existing cluster, it's fine. If it is run
after

Not exactly. There is no master in mirrored MQ cluster. We define the
rabbit_hosts configuration option from Oslo.messaging. What ensures all
queue masters will be spread around all of MQ nodes in a long run. And
we use a master abstraction only for the Pacemaker RA clustering layer.
Here, a "master" is the MQ node what joins the rest of the MQ nodes.

> power failure and recovery, it introduces the a new problem.

We do erase the node master attribute in CIB for such cases. This should
not bring problems into the master election logic.

>
> After power recovery, if some of the RabbitMQ instances reach step (2)
> roughly at the same time (within 30s which is hard coded in RabbitMQ) as
> the original RabbitMQ master instance, they form the original cluster
> again and then shutdown. The other instances would have to wait for 30s
> before it reports failure waiting for tables, and be  reset to a
> standalone cluster.
>
> In RabbitMQ documentation [4], it is also mentioned that if we shutdown
> RabbitMQ master, a new master is elected from the rest of slaves. If we

(Note, the RabbitMQ documentation mentions *queue* masters and slaves,
which are not the case for the Pacemaker RA clustering abstraction layer.)

> continue to shutdown nodes in step (3), we reach a point that the last
> node is the RabbitMQ master, and pacemaker is not aware of it. I can see
> there is code to bookkeeping a "rabbit-start-time" attribute in
> pacemaker to record the most long lived instance to help pacemaker
> determine the master, but it does not cover the case mentioned above.

We made an assumption what the node with the highest MQ uptime should
know the most about recent cluster state, so other nodes must join it.
RA OCF does not work with queue masters directly.

> A
> recent patch [5] checks existing "rabbit-master" attribute but it
> neither cover the above case.
>
> So in step (4), pacemaker determines a different master which was a
> RabbitMQ slave last time. It would wait for its original RabbitMQ master
> for 30s and fail, then it gets reset to a standalone cluster. Here we
> get some different clusters, so in step (5) and (6), it is likely to
> report error in log saying timeout waiting for tables or fail to merge
> mnesia database schema, then the those instances get reset. You can
> easily re-produce the case by hard resetting power of all the controllers.
>
> As you can see, if you are unlucky, there would be several "30s timeout
> and reset" before you finally get a healthy RabbitMQ cluster.

The full MQ cluster reassemble logic is far from the perfect state,
indeed. This might erase all mnesia files, hence any custom entities,
like users or vhosts, would be removed as well. Note, we do not
configure durable queues for Openstack so there is nothing to care about
here - the full cluster downtime assumes there will be no AMQP messages
stored at all.

>
> I find three possible solutions.
> A. Using rabbitmqctl force_boot option [6]
> It will skips waiting for 30s and resetting cluster, but just assume the
> current node is the master and continue to operate. This is feasible
> because the original RabbitMQ master would discards the local state and
> sync with the new master after it joins a new cluster [7]. So we can be
> sure that after step (4) and (6), the pacemaker determined master
> instance is started unconditionally, and it will be the same as RabbitMQ
> master, and all operations run without 30s timeout. I find this option
> is only available in newer RabbitMQ release, and updating RabbitMQ might
> introduce other compatibility problems.

Yes, this option is only supported for newest RabbitMQ versions. But we
definitely should look how this could help.

>
> B. Turn RabbitMQ into cloned instance and use pause_minority instead of
> autoheal [8]

Indeed, there are cases when MQ's autoheal can do nothing with existing
partitions and remains partitioned for ever, for example:

Masters: [ node-1 ]
Slaves: [ node-2 node-3 ]
root at node-1:~# rabbitmqctl cluster_status
Cluster status of node 'rabbit at node-1' ...
[{nodes,[{disc,['rabbit at node-1','rabbit at node-2']}]},
{running_nodes,['rabbit at node-1']},
{cluster_name,<<"rabbit at node-2">>},
{partitions,[]}]
...done.
root at node-2:~# rabbitmqctl cluster_status
Cluster status of node 'rabbit at node-2' ...
[{nodes,[{disc,['rabbit at node-2']}]}]
...done.
root at node-3:~# rabbitmqctl cluster_status
Cluster status of node 'rabbit at node-3' ...
[{nodes,[{disc,['rabbit at node-1','rabbit at node-2','rabbit at node-3']}]},
{running_nodes,['rabbit at node-3']},
{cluster_name,<<"rabbit at node-2">>},
{partitions,[]}]

So we should test the pause-minority value as well.
But I strongly believe we should make MQ multi-state clone to support
many masters, related bp [7]

[7]
https://blueprints.launchpad.net/fuel/+spec/rabbitmq-pacemaker-multimaster-clone

> This works like MySQL-wss. It let RabbitMQ cluster itself deal with
> partition in a manner similar to pacemaker quorum mechanism. When there
> is network partition, instances in the minority partition pauses
> themselves automatically. Pacemaker does not have to track who is the
> RabbitMQ master, who lives longest, who to promote... It just starts all
> the clones, done. This leads to huge change in RabbitMQ resource agent,
> and the stability and other impact is to be tested.

Well, we should not mess the queue masters and multi-clone master for MQ
resource in the pacemaker.
As I said, pacemaker RA has nothing to do with queue masters. And we
introduced this "master" mostly in order to support the full cluster
reassemble case - there must be a node promoted and other nodes should join.

>
> C. Creating a "force_load" file
> After reading RabbitMQ source code, I find that the actual thing it does
> in solution A is just creating an empty file named "force_load" in
> mnesia database dir, then mnesia thinks it is the last node shut down in
> the last time and boot itself as the master. This implementation keeps
> the same from v3.1.4 to the latest RabbitMQ master branch. I think we
> can make use of this little trick. The change is adding just one line in
> "try_to_start_rmq_app()" function.
>
> touch "${MNESIA_FILES}/force_load" && \
>   chown rabbitmq:rabbitmq "${MNESIA_FILES}/force_load"

This is a very good point, thank you.

>
> [4] http://www.rabbitmq.com/ha.html
> [5] https://review.openstack.org/#/c/169291/
> [6] https://www.rabbitmq.com/clustering.html
> [7] http://www.rabbitmq.com/partitions.html#recovering
> [8] http://www.rabbitmq.com/partitions.html#automatic-handling
>
> Maybe you have better ideas on this. Please share your thoughts.

Thank you for a thorough feedback! This was a really great job.

>
> ----
> Best wishes!
> Zhou Zheng Sheng / ???  Software Engineer
> Beijing AWcloud Software Co., Ltd.
>


-- 
Best regards,
Bogdan Dobrelya,
Skype #bogdando_at_yahoo.com
Irc #bogdando



More information about the OpenStack-dev mailing list