[openstack-dev] [Fuel] Speed Up RabbitMQ Recovering
Andrew Woodward
xarses at gmail.com
Tue May 19 20:05:15 UTC 2015
On Thu, May 7, 2015 at 5:01 PM Andrew Beekhof <abeekhof at redhat.com> wrote:
>
> > On 5 May 2015, at 1:19 pm, Zhou Zheng Sheng / 周征晟 <
> zhengsheng at awcloud.com> wrote:
> >
> > Thank you Andrew.
> >
> > on 2015/05/05 08:03, Andrew Beekhof wrote:
> >>> On 28 Apr 2015, at 11:15 pm, Bogdan Dobrelya <bdobrelia at mirantis.com>
> wrote:
> >>>
> >>>> Hello,
> >>> Hello, Zhou
> >>>
> >>>> I using Fuel 6.0.1 and find that RabbitMQ recover time is long after
> >>>> power failure. I have a running HA environment, then I reset power of
> >>>> all the machines at the same time. I observe that after reboot it
> >>>> usually takes 10 minutes for RabittMQ cluster to appear running
> >>>> master-slave mode in pacemaker. If I power off all the 3 controllers
> and
> >>>> only start 2 of them, the downtime sometimes can be as long as 20
> minutes.
> >>> Yes, this is a known issue [0]. Note, there were many bugfixes, like
> >>> [1],[2],[3], merged for MQ OCF script, so you may want to try to
> >>> backport them as well by the following guide [4]
> >>>
> >>> [0] https://bugs.launchpad.net/fuel/+bug/1432603
> >>> [1] https://review.openstack.org/#/c/175460/
> >>> [2] https://review.openstack.org/#/c/175457/
> >>> [3] https://review.openstack.org/#/c/175371/
> >>> [4] https://review.openstack.org/#/c/170476/
> >> Is there a reason you’re using a custom OCF script instead of the
> upstream[a] one?
> >> Please have a chat with David (the maintainer, in CC) if there is
> something you believe is wrong with it.
> >>
> >> [a]
> https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/rabbitmq-cluster
> >
> > I'm using the OCF script from the Fuel project, specifically from the
> > "6.0" stable branch [alpha].
>
> Ah, I’m still learning who is who... i thought you were part of that
> project :-)
>
> >
> > Comparing with upstream OCF code, the main difference is that Fuel
> > RabbitMQ OCF is a master-slave resource. Fuel RabbitMQ OCF does more
> > bookkeeping, for example, blocking client access when RabbitMQ cluster
> > is not ready. I beleive the upstream OCF should be OK to use as well
> > after I read the code, but it might not fit into the Fuel project. As
> > far as I test, the Fuel OCF script is good except sometimes the full
> > reassemble time is long, and as I find out, it is mostly because the
> > Fuel MySQL Galera OCF script keeps pacemaker from promoting RabbitMQ
> > resource, as I mentioned in the previous emails.
> >
> > Maybe Vladimir and Sergey can give us more insight on why Fuel needs a
> > master-slave RabbitMQ.
>
> That would be good to know.
> Browsing the agent, promote seems to be a no-op if rabbit is already
> running.
>
>
To the master / slave reason due to how the ocf script is structured to
deal with rabbit's poor ability to handle its self in some scenarios.
Hopefully the state transition diagram [5] is enough to clarify what's
going on.
[5] http://goo.gl/PPNrw7
> > I see Vladimir and Sergey works on the original
> > Fuel blueprint "RabbitMQ cluster" [beta].
> >
> > [alpha]
> >
> https://github.com/stackforge/fuel-library/blob/stable/6.0/deployment/puppet/nova/files/ocf/rabbitmq
> > [beta]
> >
> https://blueprints.launchpad.net/fuel/+spec/rabbitmq-cluster-controlled-by-pacemaker
> >
> >>>> I have a little investigation and find out there are some possible
> causes.
> >>>>
> >>>> 1. MySQL Recovery Takes Too Long [1] and Blocking RabbitMQ Clustering
> in
> >>>> Pacemaker
> >>>>
> >>>> The pacemaker resource p_mysql start timeout is set to 475s. Sometimes
> >>>> MySQL-wss fails to start after power failure, and pacemaker would wait
> >>>> 475s before retry starting it. The problem is that pacemaker divides
> >>>> resource state transitions into batches. Since RabbitMQ is
> master-slave
> >>>> resource, I assume that starting all the slaves and promoting master
> are
> >>>> put into two different batches. If unfortunately starting all RabbitMQ
> >>>> slaves are put in the same batch as MySQL starting, even if RabbitMQ
> >>>> slaves and all other resources are ready, pacemaker will not continue
> >>>> but just wait for MySQL timeout.
> >>> Could you please elaborate the what is the same/different batches for
> MQ
> >>> and DB? Note, there is a MQ clustering logic flow charts available here
> >>> [5] and we're planning to release a dedicated technical bulletin for
> this.
> >>>
> >>> [5] http://goo.gl/PPNrw7
> >>>
> >>>> I can re-produce this by hard powering off all the controllers and
> start
> >>>> them again. It's more likely to trigger MySQL failure in this way.
> Then
> >>>> I observe that if there is one cloned mysql instance not starting, the
> >>>> whole pacemaker cluster gets stuck and does not emit any log. On the
> >>>> host of the failed instance, I can see a mysql resource agent process
> >>>> calling the sleep command. If I kill that process, the pacemaker comes
> >>>> back alive and RabbitMQ master gets promoted. In fact this long
> timeout
> >>>> is blocking every resource from state transition in pacemaker.
> >>>>
> >>>> This maybe a known problem of pacemaker and there are some discussions
> >>>> in Linux-HA mailing list [2]. It might not be fixed in the near
> future.
> >>>> It seems in generally it's bad to have long timeout in state
> transition
> >>>> actions (start/stop/promote/demote). There maybe another way to
> >>>> implement MySQL-wss resource agent to use a short start timeout and
> >>>> monitor the wss cluster state using monitor action.
> >>> This is very interesting, thank you! I believe all commands for MySQL
> RA
> >>> OCF script should be as well wrapped with timeout -SIGTERM or -SIGKILL
> >>> as we did for MQ RA OCF. And there should no be any sleep calls. I
> >>> created a bug for this [6].
> >>>
> >>> [6] https://bugs.launchpad.net/fuel/+bug/1449542
> >>>
> >>>> I also find a fix to improve MySQL start timeout [3]. It shortens the
> >>>> timeout to 300s. At the time I sending this email, I can not find it
> in
> >>>> stable/6.0 branch. Maybe the maintainer needs to cherry-pick it to
> >>>> stable/6.0 ?
> >>>>
> >>>> [1] https://bugs.launchpad.net/fuel/+bug/1441885
> >>>> [2]
> http://lists.linux-ha.org/pipermail/linux-ha/2014-March/047989.html
> >>>> [3] https://review.openstack.org/#/c/171333/
> >>>>
> >>>>
> >>>> 2. RabbitMQ Resource Agent Breaks Existing Cluster
> >>>>
> >>>> Read the code of the RabbitMQ resource agent, I find it does the
> >>>> following to start RabbitMQ master-slave cluster.
> >>>> On all the controllers:
> >>>> (1) Start Erlang beam process
> >>>> (2) Start RabbitMQ App (If failed, reset mnesia DB and cluster state)
> >>>> (3) Stop RabbitMQ App but do not stop the beam process
> >>>>
> >>>> Then in pacemaker, all the RabbitMQ instances are in slave state.
> After
> >>>> pacemaker determines the master, it does the following.
> >>>> On the to-be-master host:
> >>>> (4) Start RabbitMQ App (If failed, reset mnesia DB and cluster state)
> >>>> On the slaves hosts:
> >>>> (5) Start RabbitMQ App (If failed, reset mnesia DB and cluster state)
> >>>> (6) Join RabbitMQ cluster of the master host
> >>>>
> >>> Yes, something like that. As I mentioned, there were several bug fixes
> >>> in the 6.1 dev, and you can also check the MQ clustering flow charts.
> >>>
> >>>> As far as I can understand, this process is to make sure the master
> >>>> determined by pacemaker is the same as the master determined in
> RabbitMQ
> >>>> cluster. If there is no existing cluster, it's fine. If it is run
> >>> after
> >>>
> >>> Not exactly. There is no master in mirrored MQ cluster. We define the
> >>> rabbit_hosts configuration option from Oslo.messaging. What ensures all
> >>> queue masters will be spread around all of MQ nodes in a long run. And
> >>> we use a master abstraction only for the Pacemaker RA clustering layer.
> >>> Here, a "master" is the MQ node what joins the rest of the MQ nodes.
> >>>
> >>>> power failure and recovery, it introduces the a new problem.
> >>> We do erase the node master attribute in CIB for such cases. This
> should
> >>> not bring problems into the master election logic.
> >>>
> >>>> After power recovery, if some of the RabbitMQ instances reach step (2)
> >>>> roughly at the same time (within 30s which is hard coded in RabbitMQ)
> as
> >>>> the original RabbitMQ master instance, they form the original cluster
> >>>> again and then shutdown. The other instances would have to wait for
> 30s
> >>>> before it reports failure waiting for tables, and be reset to a
> >>>> standalone cluster.
> >>>>
> >>>> In RabbitMQ documentation [4], it is also mentioned that if we
> shutdown
> >>>> RabbitMQ master, a new master is elected from the rest of slaves. If
> we
> >>> (Note, the RabbitMQ documentation mentions *queue* masters and slaves,
> >>> which are not the case for the Pacemaker RA clustering abstraction
> layer.)
> >>>
> >>>> continue to shutdown nodes in step (3), we reach a point that the last
> >>>> node is the RabbitMQ master, and pacemaker is not aware of it. I can
> see
> >>>> there is code to bookkeeping a "rabbit-start-time" attribute in
> >>>> pacemaker to record the most long lived instance to help pacemaker
> >>>> determine the master, but it does not cover the case mentioned above.
> >>> We made an assumption what the node with the highest MQ uptime should
> >>> know the most about recent cluster state, so other nodes must join it.
> >>> RA OCF does not work with queue masters directly.
> >>>
> >>>> A
> >>>> recent patch [5] checks existing "rabbit-master" attribute but it
> >>>> neither cover the above case.
> >>>>
> >>>> So in step (4), pacemaker determines a different master which was a
> >>>> RabbitMQ slave last time. It would wait for its original RabbitMQ
> master
> >>>> for 30s and fail, then it gets reset to a standalone cluster. Here we
> >>>> get some different clusters, so in step (5) and (6), it is likely to
> >>>> report error in log saying timeout waiting for tables or fail to merge
> >>>> mnesia database schema, then the those instances get reset. You can
> >>>> easily re-produce the case by hard resetting power of all the
> controllers.
> >>>>
> >>>> As you can see, if you are unlucky, there would be several "30s
> timeout
> >>>> and reset" before you finally get a healthy RabbitMQ cluster.
> >>> The full MQ cluster reassemble logic is far from the perfect state,
> >>> indeed. This might erase all mnesia files, hence any custom entities,
> >>> like users or vhosts, would be removed as well. Note, we do not
> >>> configure durable queues for Openstack so there is nothing to care
> about
> >>> here - the full cluster downtime assumes there will be no AMQP messages
> >>> stored at all.
> >>>
> >>>> I find three possible solutions.
> >>>> A. Using rabbitmqctl force_boot option [6]
> >>>> It will skips waiting for 30s and resetting cluster, but just assume
> the
> >>>> current node is the master and continue to operate. This is feasible
> >>>> because the original RabbitMQ master would discards the local state
> and
> >>>> sync with the new master after it joins a new cluster [7]. So we can
> be
> >>>> sure that after step (4) and (6), the pacemaker determined master
> >>>> instance is started unconditionally, and it will be the same as
> RabbitMQ
> >>>> master, and all operations run without 30s timeout. I find this option
> >>>> is only available in newer RabbitMQ release, and updating RabbitMQ
> might
> >>>> introduce other compatibility problems.
> >>> Yes, this option is only supported for newest RabbitMQ versions. But we
> >>> definitely should look how this could help.
> >>>
> >>>> B. Turn RabbitMQ into cloned instance and use pause_minority instead
> of
> >>>> autoheal [8]
> >>> Indeed, there are cases when MQ's autoheal can do nothing with existing
> >>> partitions and remains partitioned for ever, for example:
> >>>
> >>> Masters: [ node-1 ]
> >>> Slaves: [ node-2 node-3 ]
> >>> root at node-1:~# rabbitmqctl cluster_status
> >>> Cluster status of node 'rabbit at node-1' ...
> >>> [{nodes,[{disc,['rabbit at node-1','rabbit at node-2']}]},
> >>> {running_nodes,['rabbit at node-1']},
> >>> {cluster_name,<<"rabbit at node-2">>},
> >>> {partitions,[]}]
> >>> ...done.
> >>> root at node-2:~# rabbitmqctl cluster_status
> >>> Cluster status of node 'rabbit at node-2' ...
> >>> [{nodes,[{disc,['rabbit at node-2']}]}]
> >>> ...done.
> >>> root at node-3:~# rabbitmqctl cluster_status
> >>> Cluster status of node 'rabbit at node-3' ...
> >>> [{nodes,[{disc,['rabbit at node-1','rabbit at node-2','rabbit at node-3']}]},
> >>> {running_nodes,['rabbit at node-3']},
> >>> {cluster_name,<<"rabbit at node-2">>},
> >>> {partitions,[]}]
> >>>
> >>> So we should test the pause-minority value as well.
> >>> But I strongly believe we should make MQ multi-state clone to support
> >>> many masters, related bp [7]
> >>>
> >>> [7]
> >>>
> https://blueprints.launchpad.net/fuel/+spec/rabbitmq-pacemaker-multimaster-clone
> >>>
> >>>> This works like MySQL-wss. It let RabbitMQ cluster itself deal with
> >>>> partition in a manner similar to pacemaker quorum mechanism. When
> there
> >>>> is network partition, instances in the minority partition pauses
> >>>> themselves automatically. Pacemaker does not have to track who is the
> >>>> RabbitMQ master, who lives longest, who to promote... It just starts
> all
> >>>> the clones, done. This leads to huge change in RabbitMQ resource
> agent,
> >>>> and the stability and other impact is to be tested.
> >>> Well, we should not mess the queue masters and multi-clone master for
> MQ
> >>> resource in the pacemaker.
> >>> As I said, pacemaker RA has nothing to do with queue masters. And we
> >>> introduced this "master" mostly in order to support the full cluster
> >>> reassemble case - there must be a node promoted and other nodes should
> join.
> >>>
> >>>> C. Creating a "force_load" file
> >>>> After reading RabbitMQ source code, I find that the actual thing it
> does
> >>>> in solution A is just creating an empty file named "force_load" in
> >>>> mnesia database dir, then mnesia thinks it is the last node shut down
> in
> >>>> the last time and boot itself as the master. This implementation keeps
> >>>> the same from v3.1.4 to the latest RabbitMQ master branch. I think we
> >>>> can make use of this little trick. The change is adding just one line
> in
> >>>> "try_to_start_rmq_app()" function.
> >>>>
> >>>> touch "${MNESIA_FILES}/force_load" && \
> >>>> chown rabbitmq:rabbitmq "${MNESIA_FILES}/force_load"
> >>> This is a very good point, thank you.
> >>>
> >>>> [4] http://www.rabbitmq.com/ha.html
> >>>> [5] https://review.openstack.org/#/c/169291/
> >>>> [6] https://www.rabbitmq.com/clustering.html
> >>>> [7] http://www.rabbitmq.com/partitions.html#recovering
> >>>> [8] http://www.rabbitmq.com/partitions.html#automatic-handling
> >>>>
> >>>> Maybe you have better ideas on this. Please share your thoughts.
> >>> Thank you for a thorough feedback! This was a really great job.
> >>>
> >>>> ----
> >>>> Best wishes!
> >>>> Zhou Zheng Sheng / ??? Software Engineer
> >>>> Beijing AWcloud Software Co., Ltd.
> >>>>
> >>>
> >>> --
> >>> Best regards,
> >>> Bogdan Dobrelya,
> >>> Skype #bogdando_at_yahoo.com
> >>> Irc #bogdando
> >>>
> >>>
> __________________________________________________________________________
> >>> OpenStack Development Mailing List (not for usage questions)
> >>> Unsubscribe:
> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> >>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >>
> >>
> __________________________________________________________________________
> >> OpenStack Development Mailing List (not for usage questions)
> >> Unsubscribe:
> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20150519/4175e424/attachment.html>
More information about the OpenStack-dev
mailing list