<div dir="ltr"><br><br><div class="gmail_quote">On Thu, May 7, 2015 at 5:01 PM Andrew Beekhof <<a href="mailto:abeekhof@redhat.com">abeekhof@redhat.com</a>> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>

> On 5 May 2015, at 1:19 pm, Zhou Zheng Sheng / 周征晟 <<a href="mailto:zhengsheng@awcloud.com" target="_blank">zhengsheng@awcloud.com</a>> wrote:<br>

><br>

> Thank you Andrew.<br>

><br>

> on 2015/05/05 08:03, Andrew Beekhof wrote:<br>

>>> On 28 Apr 2015, at 11:15 pm, Bogdan Dobrelya <<a href="mailto:bdobrelia@mirantis.com" target="_blank">bdobrelia@mirantis.com</a>> wrote:<br>

>>><br>

>>>> Hello,<br>

>>> Hello, Zhou<br>

>>><br>

>>>> I using Fuel 6.0.1 and find that RabbitMQ recover time is long after<br>

>>>> power failure. I have a running HA environment, then I reset power of<br>

>>>> all the machines at the same time. I observe that after reboot it<br>

>>>> usually takes 10 minutes for RabittMQ cluster to appear running<br>

>>>> master-slave mode in pacemaker. If I power off all the 3 controllers and<br>

>>>> only start 2 of them, the downtime sometimes can be as long as 20 minutes.<br>

>>> Yes, this is a known issue [0]. Note, there were many bugfixes, like<br>

>>> [1],[2],[3], merged for MQ OCF script, so you may want to try to<br>

>>> backport them as well by the following guide [4]<br>

>>><br>

>>> [0] <a href="https://bugs.launchpad.net/fuel/+bug/1432603" target="_blank">https://bugs.launchpad.net/fuel/+bug/1432603</a><br>

>>> [1] <a href="https://review.openstack.org/#/c/175460/" target="_blank">https://review.openstack.org/#/c/175460/</a><br>

>>> [2] <a href="https://review.openstack.org/#/c/175457/" target="_blank">https://review.openstack.org/#/c/175457/</a><br>

>>> [3] <a href="https://review.openstack.org/#/c/175371/" target="_blank">https://review.openstack.org/#/c/175371/</a><br>

>>> [4] <a href="https://review.openstack.org/#/c/170476/" target="_blank">https://review.openstack.org/#/c/170476/</a><br>

>> Is there a reason you’re using a custom OCF script instead of the upstream[a] one?<br>

>> Please have a chat with David (the maintainer, in CC) if there is something you believe is wrong with it.<br>

>><br>

>> [a] <a href="https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/rabbitmq-cluster" target="_blank">https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/rabbitmq-cluster</a><br>

><br>

> I'm using the OCF script from the Fuel project, specifically from the<br>

> "6.0" stable branch [alpha].<br>

<br>

Ah, I’m still learning who is who... i thought you were part of that project :-)<br>

<br>

><br>

> Comparing with upstream OCF code, the main difference is that Fuel<br>

> RabbitMQ OCF is a master-slave resource. Fuel RabbitMQ OCF does more<br>

> bookkeeping, for example, blocking client access when RabbitMQ cluster<br>

> is not ready. I beleive the upstream OCF should be OK to use as well<br>

> after I read the code, but it might not fit into the Fuel project. As<br>

> far as I test, the Fuel OCF script is good except sometimes the full<br>

> reassemble time is long, and as I find out, it is mostly because the<br>

> Fuel MySQL Galera OCF script keeps pacemaker from promoting RabbitMQ<br>

> resource, as I mentioned in the previous emails.<br>

><br>

> Maybe Vladimir and Sergey can give us more insight on why Fuel needs a<br>

> master-slave RabbitMQ.<br>

<br>

That would be good to know.<br>

Browsing the agent, promote seems to be a no-op if rabbit is already running.<br>

<br></blockquote><div><br></div><div>To the master / slave reason due to how the ocf script is structured to deal with rabbit's poor ability to handle its self in some scenarios. Hopefully the state transition diagram [5] is enough to clarify what's going on.</div><div><br></div><div>[5] <a href="http://goo.gl/PPNrw7">http://goo.gl/PPNrw7</a></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

> I see Vladimir and Sergey works on the original<br>

> Fuel blueprint "RabbitMQ cluster" [beta].<br>

><br>

> [alpha]<br>

> <a href="https://github.com/stackforge/fuel-library/blob/stable/6.0/deployment/puppet/nova/files/ocf/rabbitmq" target="_blank">https://github.com/stackforge/fuel-library/blob/stable/6.0/deployment/puppet/nova/files/ocf/rabbitmq</a><br>

> [beta]<br>

> <a href="https://blueprints.launchpad.net/fuel/+spec/rabbitmq-cluster-controlled-by-pacemaker" target="_blank">https://blueprints.launchpad.net/fuel/+spec/rabbitmq-cluster-controlled-by-pacemaker</a><br>

><br>

>>>> I have a little investigation and find out there are some possible causes.<br>

>>>><br>

>>>> 1. MySQL Recovery Takes Too Long [1] and Blocking RabbitMQ Clustering in<br>

>>>> Pacemaker<br>

>>>><br>

>>>> The pacemaker resource p_mysql start timeout is set to 475s. Sometimes<br>

>>>> MySQL-wss fails to start after power failure, and pacemaker would wait<br>

>>>> 475s before retry starting it. The problem is that pacemaker divides<br>

>>>> resource state transitions into batches. Since RabbitMQ is master-slave<br>

>>>> resource, I assume that starting all the slaves and promoting master are<br>

>>>> put into two different batches. If unfortunately starting all RabbitMQ<br>

>>>> slaves are put in the same batch as MySQL starting, even if RabbitMQ<br>

>>>> slaves and all other resources are ready, pacemaker will not continue<br>

>>>> but just wait for MySQL timeout.<br>

>>> Could you please elaborate the what is the same/different batches for MQ<br>

>>> and DB? Note, there is a MQ clustering logic flow charts available here<br>

>>> [5] and we're planning to release a dedicated technical bulletin for this.<br>

>>><br>

>>> [5] <a href="http://goo.gl/PPNrw7" target="_blank">http://goo.gl/PPNrw7</a><br>

>>><br>

>>>> I can re-produce this by hard powering off all the controllers and start<br>

>>>> them again. It's more likely to trigger MySQL failure in this way. Then<br>

>>>> I observe that if there is one cloned mysql instance not starting, the<br>

>>>> whole pacemaker cluster gets stuck and does not emit any log. On the<br>

>>>> host of the failed instance, I can see a mysql resource agent process<br>

>>>> calling the sleep command. If I kill that process, the pacemaker comes<br>

>>>> back alive and RabbitMQ master gets promoted. In fact this long timeout<br>

>>>> is blocking every resource from state transition in pacemaker.<br>

>>>><br>

>>>> This maybe a known problem of pacemaker and there are some discussions<br>

>>>> in Linux-HA mailing list [2]. It might not be fixed in the near future.<br>

>>>> It seems in generally it's bad to have long timeout in state transition<br>

>>>> actions (start/stop/promote/demote). There maybe another way to<br>

>>>> implement MySQL-wss resource agent to use a short start timeout and<br>

>>>> monitor the wss cluster state using monitor action.<br>

>>> This is very interesting, thank you! I believe all commands for MySQL RA<br>

>>> OCF script should be as well wrapped with timeout -SIGTERM or -SIGKILL<br>

>>> as we did for MQ RA OCF. And there should no be any sleep calls. I<br>

>>> created a bug for this [6].<br>

>>><br>

>>> [6] <a href="https://bugs.launchpad.net/fuel/+bug/1449542" target="_blank">https://bugs.launchpad.net/fuel/+bug/1449542</a><br>

>>><br>

>>>> I also find a fix to improve MySQL start timeout [3]. It shortens the<br>

>>>> timeout to 300s. At the time I sending this email, I can not find it in<br>

>>>> stable/6.0 branch. Maybe the maintainer needs to cherry-pick it to<br>

>>>> stable/6.0 ?<br>

>>>><br>

>>>> [1] <a href="https://bugs.launchpad.net/fuel/+bug/1441885" target="_blank">https://bugs.launchpad.net/fuel/+bug/1441885</a><br>

>>>> [2] <a href="http://lists.linux-ha.org/pipermail/linux-ha/2014-March/047989.html" target="_blank">http://lists.linux-ha.org/pipermail/linux-ha/2014-March/047989.html</a><br>

>>>> [3] <a href="https://review.openstack.org/#/c/171333/" target="_blank">https://review.openstack.org/#/c/171333/</a><br>

>>>><br>

>>>><br>

>>>> 2. RabbitMQ Resource Agent Breaks Existing Cluster<br>

>>>><br>

>>>> Read the code of the RabbitMQ resource agent, I find it does the<br>

>>>> following to start RabbitMQ master-slave cluster.<br>

>>>> On all the controllers:<br>

>>>> (1) Start Erlang beam process<br>

>>>> (2) Start RabbitMQ App (If failed, reset mnesia DB and cluster state)<br>

>>>> (3) Stop RabbitMQ App but do not stop the beam process<br>

>>>><br>

>>>> Then in pacemaker, all the RabbitMQ instances are in slave state. After<br>

>>>> pacemaker determines the master, it does the following.<br>

>>>> On the to-be-master host:<br>

>>>> (4) Start RabbitMQ App (If failed, reset mnesia DB and cluster state)<br>

>>>> On the slaves hosts:<br>

>>>> (5) Start RabbitMQ App (If failed, reset mnesia DB and cluster state)<br>

>>>> (6) Join RabbitMQ cluster of the master host<br>

>>>><br>

>>> Yes, something like that. As I mentioned, there were several bug fixes<br>

>>> in the 6.1 dev, and you can also check the MQ clustering flow charts.<br>

>>><br>

>>>> As far as I can understand, this process is to make sure the master<br>

>>>> determined by pacemaker is the same as the master determined in RabbitMQ<br>

>>>> cluster. If there is no existing cluster, it's fine. If it is run<br>

>>> after<br>

>>><br>

>>> Not exactly. There is no master in mirrored MQ cluster. We define the<br>

>>> rabbit_hosts configuration option from Oslo.messaging. What ensures all<br>

>>> queue masters will be spread around all of MQ nodes in a long run. And<br>

>>> we use a master abstraction only for the Pacemaker RA clustering layer.<br>

>>> Here, a "master" is the MQ node what joins the rest of the MQ nodes.<br>

>>><br>

>>>> power failure and recovery, it introduces the a new problem.<br>

>>> We do erase the node master attribute in CIB for such cases. This should<br>

>>> not bring problems into the master election logic.<br>

>>><br>

>>>> After power recovery, if some of the RabbitMQ instances reach step (2)<br>

>>>> roughly at the same time (within 30s which is hard coded in RabbitMQ) as<br>

>>>> the original RabbitMQ master instance, they form the original cluster<br>

>>>> again and then shutdown. The other instances would have to wait for 30s<br>

>>>> before it reports failure waiting for tables, and be  reset to a<br>

>>>> standalone cluster.<br>

>>>><br>

>>>> In RabbitMQ documentation [4], it is also mentioned that if we shutdown<br>

>>>> RabbitMQ master, a new master is elected from the rest of slaves. If we<br>

>>> (Note, the RabbitMQ documentation mentions *queue* masters and slaves,<br>

>>> which are not the case for the Pacemaker RA clustering abstraction layer.)<br>

>>><br>

>>>> continue to shutdown nodes in step (3), we reach a point that the last<br>

>>>> node is the RabbitMQ master, and pacemaker is not aware of it. I can see<br>

>>>> there is code to bookkeeping a "rabbit-start-time" attribute in<br>

>>>> pacemaker to record the most long lived instance to help pacemaker<br>

>>>> determine the master, but it does not cover the case mentioned above.<br>

>>> We made an assumption what the node with the highest MQ uptime should<br>

>>> know the most about recent cluster state, so other nodes must join it.<br>

>>> RA OCF does not work with queue masters directly.<br>

>>><br>

>>>> A<br>

>>>> recent patch [5] checks existing "rabbit-master" attribute but it<br>

>>>> neither cover the above case.<br>

>>>><br>

>>>> So in step (4), pacemaker determines a different master which was a<br>

>>>> RabbitMQ slave last time. It would wait for its original RabbitMQ master<br>

>>>> for 30s and fail, then it gets reset to a standalone cluster. Here we<br>

>>>> get some different clusters, so in step (5) and (6), it is likely to<br>

>>>> report error in log saying timeout waiting for tables or fail to merge<br>

>>>> mnesia database schema, then the those instances get reset. You can<br>

>>>> easily re-produce the case by hard resetting power of all the controllers.<br>

>>>><br>

>>>> As you can see, if you are unlucky, there would be several "30s timeout<br>

>>>> and reset" before you finally get a healthy RabbitMQ cluster.<br>

>>> The full MQ cluster reassemble logic is far from the perfect state,<br>

>>> indeed. This might erase all mnesia files, hence any custom entities,<br>

>>> like users or vhosts, would be removed as well. Note, we do not<br>

>>> configure durable queues for Openstack so there is nothing to care about<br>

>>> here - the full cluster downtime assumes there will be no AMQP messages<br>

>>> stored at all.<br>

>>><br>

>>>> I find three possible solutions.<br>

>>>> A. Using rabbitmqctl force_boot option [6]<br>

>>>> It will skips waiting for 30s and resetting cluster, but just assume the<br>

>>>> current node is the master and continue to operate. This is feasible<br>

>>>> because the original RabbitMQ master would discards the local state and<br>

>>>> sync with the new master after it joins a new cluster [7]. So we can be<br>

>>>> sure that after step (4) and (6), the pacemaker determined master<br>

>>>> instance is started unconditionally, and it will be the same as RabbitMQ<br>

>>>> master, and all operations run without 30s timeout. I find this option<br>

>>>> is only available in newer RabbitMQ release, and updating RabbitMQ might<br>

>>>> introduce other compatibility problems.<br>

>>> Yes, this option is only supported for newest RabbitMQ versions. But we<br>

>>> definitely should look how this could help.<br>

>>><br>

>>>> B. Turn RabbitMQ into cloned instance and use pause_minority instead of<br>

>>>> autoheal [8]<br>

>>> Indeed, there are cases when MQ's autoheal can do nothing with existing<br>

>>> partitions and remains partitioned for ever, for example:<br>

>>><br>

>>> Masters: [ node-1 ]<br>

>>> Slaves: [ node-2 node-3 ]<br>

>>> root@node-1:~# rabbitmqctl cluster_status<br>

>>> Cluster status of node 'rabbit@node-1' ...<br>

>>> [{nodes,[{disc,['rabbit@node-1','rabbit@node-2']}]},<br>

>>> {running_nodes,['rabbit@node-1']},<br>

>>> {cluster_name,<<"rabbit@node-2">>},<br>

>>> {partitions,[]}]<br>

>>> ...done.<br>

>>> root@node-2:~# rabbitmqctl cluster_status<br>

>>> Cluster status of node 'rabbit@node-2' ...<br>

>>> [{nodes,[{disc,['rabbit@node-2']}]}]<br>

>>> ...done.<br>

>>> root@node-3:~# rabbitmqctl cluster_status<br>

>>> Cluster status of node 'rabbit@node-3' ...<br>

>>> [{nodes,[{disc,['rabbit@node-1','rabbit@node-2','rabbit@node-3']}]},<br>

>>> {running_nodes,['rabbit@node-3']},<br>

>>> {cluster_name,<<"rabbit@node-2">>},<br>

>>> {partitions,[]}]<br>

>>><br>

>>> So we should test the pause-minority value as well.<br>

>>> But I strongly believe we should make MQ multi-state clone to support<br>

>>> many masters, related bp [7]<br>

>>><br>

>>> [7]<br>

>>> <a href="https://blueprints.launchpad.net/fuel/+spec/rabbitmq-pacemaker-multimaster-clone" target="_blank">https://blueprints.launchpad.net/fuel/+spec/rabbitmq-pacemaker-multimaster-clone</a><br>

>>><br>

>>>> This works like MySQL-wss. It let RabbitMQ cluster itself deal with<br>

>>>> partition in a manner similar to pacemaker quorum mechanism. When there<br>

>>>> is network partition, instances in the minority partition pauses<br>

>>>> themselves automatically. Pacemaker does not have to track who is the<br>

>>>> RabbitMQ master, who lives longest, who to promote... It just starts all<br>

>>>> the clones, done. This leads to huge change in RabbitMQ resource agent,<br>

>>>> and the stability and other impact is to be tested.<br>

>>> Well, we should not mess the queue masters and multi-clone master for MQ<br>

>>> resource in the pacemaker.<br>

>>> As I said, pacemaker RA has nothing to do with queue masters. And we<br>

>>> introduced this "master" mostly in order to support the full cluster<br>

>>> reassemble case - there must be a node promoted and other nodes should join.<br>

>>><br>

>>>> C. Creating a "force_load" file<br>

>>>> After reading RabbitMQ source code, I find that the actual thing it does<br>

>>>> in solution A is just creating an empty file named "force_load" in<br>

>>>> mnesia database dir, then mnesia thinks it is the last node shut down in<br>

>>>> the last time and boot itself as the master. This implementation keeps<br>

>>>> the same from v3.1.4 to the latest RabbitMQ master branch. I think we<br>

>>>> can make use of this little trick. The change is adding just one line in<br>

>>>> "try_to_start_rmq_app()" function.<br>

>>>><br>

>>>> touch "${MNESIA_FILES}/force_load" && \<br>

>>>> chown rabbitmq:rabbitmq "${MNESIA_FILES}/force_load"<br>

>>> This is a very good point, thank you.<br>

>>><br>

>>>> [4] <a href="http://www.rabbitmq.com/ha.html" target="_blank">http://www.rabbitmq.com/ha.html</a><br>

>>>> [5] <a href="https://review.openstack.org/#/c/169291/" target="_blank">https://review.openstack.org/#/c/169291/</a><br>

>>>> [6] <a href="https://www.rabbitmq.com/clustering.html" target="_blank">https://www.rabbitmq.com/clustering.html</a><br>

>>>> [7] <a href="http://www.rabbitmq.com/partitions.html#recovering" target="_blank">http://www.rabbitmq.com/partitions.html#recovering</a><br>

>>>> [8] <a href="http://www.rabbitmq.com/partitions.html#automatic-handling" target="_blank">http://www.rabbitmq.com/partitions.html#automatic-handling</a><br>

>>>><br>

>>>> Maybe you have better ideas on this. Please share your thoughts.<br>

>>> Thank you for a thorough feedback! This was a really great job.<br>

>>><br>

>>>> ----<br>

>>>> Best wishes!<br>

>>>> Zhou Zheng Sheng / ???  Software Engineer<br>

>>>> Beijing AWcloud Software Co., Ltd.<br>

>>>><br>

>>><br>

>>> --<br>

>>> Best regards,<br>

>>> Bogdan Dobrelya,<br>

>>> Skype #<a href="http://bogdando_at_yahoo.com" target="_blank">bogdando_at_yahoo.com</a><br>

>>> Irc #bogdando<br>

>>><br>

>>> __________________________________________________________________________<br>

>>> OpenStack Development Mailing List (not for usage questions)<br>

>>> Unsubscribe: <a href="http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe" target="_blank">OpenStack-dev-request@lists.openstack.org?subject:unsubscribe</a><br>

>>> <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>

>><br>

>> __________________________________________________________________________<br>

>> OpenStack Development Mailing List (not for usage questions)<br>

>> Unsubscribe: <a href="http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe" target="_blank">OpenStack-dev-request@lists.openstack.org?subject:unsubscribe</a><br>

>> <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>

<br>

<br>

__________________________________________________________________________<br>

OpenStack Development Mailing List (not for usage questions)<br>

Unsubscribe: <a href="http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe" target="_blank">OpenStack-dev-request@lists.openstack.org?subject:unsubscribe</a><br>

<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>

</blockquote></div></div>