[openstack-dev] [Fuel] Speed Up RabbitMQ Recovering

Zhou Zheng Sheng / 周征晟 zhengsheng at awcloud.com
Mon May 4 02:14:53 UTC 2015


Hello Sergii,

Thank you for the great explanation on Galera OCF script. I replied your
question inline.

on 2015/05/03 04:49, Sergii Golovatiuk wrote:
> Hi Zhou,
>
> Galera OCF script is a bit special. Since MySQL keeps the most
> important data we should find the most recent data on all nodes across
> the cluster. check_if_galera_pc is specially designed for that. Every
> server registers the latest status from grastate.dat to CIB. Once all
> nodes are registered, the one with the most recent data will be
> selected as Primary Component. All others should join to that node. 5
> minutes is a time for all nodes to appear and register position from
> grastate.dat to CIB. Usually, it takes much faster. Though there are
> cases when node is stuck on fsck or grub or power outlet or some other
> cases. If all nodes are registered there shouldn't be 5 minute penalty
> timeout. If one node is stuck (at least present in CIB), then all
> other nodes will be waiting for 5 minutes then will assemble cluster
> without it.
>
> Concerning dependencies, I agree that RabbitMQ may start in parallel
> to Galera cluster assemble procedure. It makes no sense to start other
> services as they are dependent on Galera and RabbitMQ.
>
> Also, I have a quick question to you. Shutting down all three
> controllers is a unique case, like whole power outage in whole
> datacenter (DC). In this case, 5 minute delay is very small comparing
> to DC recovery procedure. Reboot of one controller is more optimistic
> scenario. What's a special case to restart all 3-5 at once?

Sorry, I am not very clear about what "3-5" refers to. Is the question
about why we want to make the full reassemble time short, and why this
case is important for us?

We have some small customers forming a long-tail in local market. They
have neither dedicated datacenter houses nor dual power supply. Some of
them would even shutdown all the machines when they go home, and start
all of the machines when they start to work. Considering of data
privacy, they are not willing to put their virtual machines on the
public cloud. Usually, this kind of customer don't have IT skills to
troubleshoot a full reassemble process. We want to make this process as
simple as turning on all the machines roughly at the same time and wait
about several minutes, so they don't call our service team.

>
> Also, I would like to say a big thank for digging it out. It's very
> useful to use your findings in our next steps.
>
>
> --
> Best regards,
> Sergii Golovatiuk,
> Skype #golserge
> IRC #holser
>
> On Wed, Apr 29, 2015 at 9:38 AM, Zhou Zheng Sheng / 周征晟
> <zhengsheng at awcloud.com <mailto:zhengsheng at awcloud.com>> wrote:
>
>     Hi!
>
>     Thank you very much Vladimir and Bogdan! Thanks for the fast
>     respond and
>     rich information.
>
>     I backported MySQL and RabbitMQ ocf patches from stable/6.0 and tested
>     again. A full reassemble takes about 5mins, this improves a lot.
>     Adding
>     the "force_load" trick I mentioned in the previous email, it takes
>     about
>     4mins.
>
>     I get that there is not really a RabbitMQ master instance because
>     queue
>     masters spreads to all the RabbitMQ instances. The pacemaker master is
>     an abstract one. However there is still an mnesia node from which
>     other
>     mnesia nodes sync table schema. The exception
>     "timeout_waiting_for_tables" in log is actually reported by mnesia. By
>     default, it places a mark on the last alive mnesia node, and other
>     nodes
>     have to sync table from it
>     (http://www.erlang.org/doc/apps/mnesia/Mnesia_chap7.html#id78477).
>     RabbitMQ clustering inherits the behavior, and the last RabbitMQ
>     instance shutdown must be the first instance to start. Otherwise it
>     produces "timeout_waiting_for_tables"
>     (http://www.rabbitmq.com/clustering.html#transcript search for
>     "the last
>     node to go down").
>
>     The 1 minute difference is because without "force_load", the abstract
>     master determined by pacemaker during a promote action may not be the
>     last RabbitMQ instance shut down in the last "start" action. So
>     there is
>     chance for "rabbitmqctl start_app" to wait 30s and trigger a RabbitMQ
>     exception "timeout_waiting_for_tables". We may able to see table
>     timeout
>     and mnesa resetting for once during a reassemble process on some
>     of the
>     RabbitMQ instances, but it only introduces 30s of wait, which is
>     acceptable for me.
>
>     I also inspect the RabbitMQ resource agent code in latest master
>     branch.
>     There are timeout wrapper and other improvements which are great. It
>     does not change the master promotion process much, so it may still run
>     into the problems I described.
>
>     Please see the inline reply below.
>
>     on 2015/04/28/ 21:15, Bogdan Dobrelya wrote:
>     >> Hello,
>     > Hello, Zhou
>     >
>     >> I using Fuel 6.0.1 and find that RabbitMQ recover time is long
>     after
>     >> power failure. I have a running HA environment, then I reset
>     power of
>     >> all the machines at the same time. I observe that after reboot it
>     >> usually takes 10 minutes for RabittMQ cluster to appear running
>     >> master-slave mode in pacemaker. If I power off all the 3
>     controllers and
>     >> only start 2 of them, the downtime sometimes can be as long as
>     20 minutes.
>     > Yes, this is a known issue [0]. Note, there were many bugfixes, like
>     > [1],[2],[3], merged for MQ OCF script, so you may want to try to
>     > backport them as well by the following guide [4]
>     >
>     > [0] https://bugs.launchpad.net/fuel/+bug/1432603
>     > [1] https://review.openstack.org/#/c/175460/
>     > [2] https://review.openstack.org/#/c/175457/
>     > [3] https://review.openstack.org/#/c/175371/
>     > [4] https://review.openstack.org/#/c/170476/
>     >
>     >> I have a little investigation and find out there are some
>     possible causes.
>     >>
>     >> 1. MySQL Recovery Takes Too Long [1] and Blocking RabbitMQ
>     Clustering in
>     >> Pacemaker
>     >>
>     >> The pacemaker resource p_mysql start timeout is set to 475s.
>     Sometimes
>     >> MySQL-wss fails to start after power failure, and pacemaker
>     would wait
>     >> 475s before retry starting it. The problem is that pacemaker
>     divides
>     >> resource state transitions into batches. Since RabbitMQ is
>     master-slave
>     >> resource, I assume that starting all the slaves and promoting
>     master are
>     >> put into two different batches. If unfortunately starting all
>     RabbitMQ
>     >> slaves are put in the same batch as MySQL starting, even if
>     RabbitMQ
>     >> slaves and all other resources are ready, pacemaker will not
>     continue
>     >> but just wait for MySQL timeout.
>     > Could you please elaborate the what is the same/different
>     batches for MQ
>     > and DB? Note, there is a MQ clustering logic flow charts
>     available here
>     > [5] and we're planning to release a dedicated technical bulletin
>     for this.
>     >
>     > [5] http://goo.gl/PPNrw7
>
>     Batch is a pacemaker concept I found when I was reading its
>     documentation and code. There is a "batch-limit: 30" in the output of
>     "pcs property list --all". The pacemaker official documentation
>     explanation is that it's "The number of jobs that the TE is allowed to
>     execute in parallel." From my understanding, pacemaker maintains
>     cluster
>     states, and when we start/stop/promote/demote a resource, it
>     triggers a
>     state transition. Pacemaker puts as many as possible transition jobs
>     into a batch, and process them in parallel.
>
>     The problem is that pacemaker can only promote a resource after it
>     detects the resource is started. During a full reassemble, in the
>     first
>     transition batch, pacemaker starts all the resources including
>     MySQL and
>     RabbitMQ. Pacemaker issues resource agent "start" invocation in
>     parallel
>     and reaps the results.
>
>     For a multi-state resource agent like RabbitMQ, pacemaker needs the
>     start result reported in the first batch, then transition engine and
>     policy engine decide if it has to retry starting or promote, and put
>     this new transition job into a new batch.
>
>     I see improvements to put individual commands inside a timeout wrapper
>     in RabbitMQ resource agent, and a bug created yesterday to do the same
>     for mysql-wss. This should help but there is a loop in mysql-wss
>     function "check_if_galera_pc". It checks MySQL state each 10s till
>     timeout. From the pacemaker point of view, the resource agent
>     invocation
>     takes as long as 300s once it enters this function. So even if other
>     resource agent invocation returns, as long as MySQL resource agent
>     does
>     not return, the current batch is not done yet , and pacemaker does not
>     start the next batch. MySQL resource agent has a start timeout set to
>     300s (previously 475s). During this 300s, the cluster does not respond
>     to any state transition calls for all the resources. It looks as if
>     pacemaker gets stuck from the user point of view.
>     >> I can re-produce this by hard powering off all the controllers and start
>     >> them again. It's more likely to trigger MySQL failure in this
>     way. Then
>     >> I observe that if there is one cloned mysql instance not
>     starting, the
>     >> whole pacemaker cluster gets stuck and does not emit any log.
>     On the
>     >> host of the failed instance, I can see a mysql resource agent
>     process
>     >> calling the sleep command. If I kill that process, the
>     pacemaker comes
>     >> back alive and RabbitMQ master gets promoted. In fact this long
>     timeout
>     >> is blocking every resource from state transition in pacemaker.
>     >>
>     >> This maybe a known problem of pacemaker and there are some
>     discussions
>     >> in Linux-HA mailing list [2]. It might not be fixed in the near
>     future.
>     >> It seems in generally it's bad to have long timeout in state
>     transition
>     >> actions (start/stop/promote/demote). There maybe another way to
>     >> implement MySQL-wss resource agent to use a short start timeout and
>     >> monitor the wss cluster state using monitor action.
>     > This is very interesting, thank you! I believe all commands for
>     MySQL RA
>     > OCF script should be as well wrapped with timeout -SIGTERM or
>     -SIGKILL
>     > as we did for MQ RA OCF. And there should no be any sleep calls. I
>     > created a bug for this [6].
>     >
>     > [6] https://bugs.launchpad.net/fuel/+bug/1449542
>
>     Thank you! We might not avoid all the sleep calls, but I agree most of
>     the commands can be put in a timeout wrapper to prevent unexpected
>     stall.
>     >> I also find a fix to improve MySQL start timeout [3]. It
>     shortens the
>     >> timeout to 300s. At the time I sending this email, I can not
>     find it in
>     >> stable/6.0 branch. Maybe the maintainer needs to cherry-pick it to
>     >> stable/6.0 ?
>     >>
>     >> [1] https://bugs.launchpad.net/fuel/+bug/1441885
>     >> [2]
>     http://lists.linux-ha.org/pipermail/linux-ha/2014-March/047989.html
>     >> [3] https://review.openstack.org/#/c/171333/
>     >>
>     >>
>     >> 2. RabbitMQ Resource Agent Breaks Existing Cluster
>     >>
>     >> Read the code of the RabbitMQ resource agent, I find it does the
>     >> following to start RabbitMQ master-slave cluster.
>     >> On all the controllers:
>     >> (1) Start Erlang beam process
>     >> (2) Start RabbitMQ App (If failed, reset mnesia DB and cluster
>     state)
>     >> (3) Stop RabbitMQ App but do not stop the beam process
>     >>
>     >> Then in pacemaker, all the RabbitMQ instances are in slave
>     state. After
>     >> pacemaker determines the master, it does the following.
>     >> On the to-be-master host:
>     >> (4) Start RabbitMQ App (If failed, reset mnesia DB and cluster
>     state)
>     >> On the slaves hosts:
>     >> (5) Start RabbitMQ App (If failed, reset mnesia DB and cluster
>     state)
>     >> (6) Join RabbitMQ cluster of the master host
>     >>
>     > Yes, something like that. As I mentioned, there were several bug
>     fixes
>     > in the 6.1 dev, and you can also check the MQ clustering flow
>     charts.
>     >
>     >> As far as I can understand, this process is to make sure the master
>     >> determined by pacemaker is the same as the master determined in
>     RabbitMQ
>     >> cluster. If there is no existing cluster, it's fine. If it is run
>     > after
>     >
>     > Not exactly. There is no master in mirrored MQ cluster. We
>     define the
>     > rabbit_hosts configuration option from Oslo.messaging. What
>     ensures all
>     > queue masters will be spread around all of MQ nodes in a long
>     run. And
>     > we use a master abstraction only for the Pacemaker RA clustering
>     layer.
>     > Here, a "master" is the MQ node what joins the rest of the MQ nodes.
>     >
>     Really thank you for explaining this. I just have a look at the output
>     of "rabbitmqctl list_queues name slave_pids synchronised_slave_pids",
>     it's as you said indeed.
>     >> power failure and recovery, it introduces the a new problem.
>     > We do erase the node master attribute in CIB for such cases.
>     This should
>     > not bring problems into the master election logic.
>
>     The problem is described at the front of the mail.
>     >> After power recovery, if some of the RabbitMQ instances reach step (2)
>     >> roughly at the same time (within 30s which is hard coded in
>     RabbitMQ) as
>     >> the original RabbitMQ master instance, they form the original
>     cluster
>     >> again and then shutdown. The other instances would have to wait
>     for 30s
>     >> before it reports failure waiting for tables, and be  reset to a
>     >> standalone cluster.
>     >>
>     >> In RabbitMQ documentation [4], it is also mentioned that if we
>     shutdown
>     >> RabbitMQ master, a new master is elected from the rest of
>     slaves. If we
>     > (Note, the RabbitMQ documentation mentions *queue* masters and
>     slaves,
>     > which are not the case for the Pacemaker RA clustering
>     abstraction layer.)
>     Thank you for clarifying this.
>     >> continue to shutdown nodes in step (3), we reach a point that the last
>     >> node is the RabbitMQ master, and pacemaker is not aware of it.
>     I can see
>     >> there is code to bookkeeping a "rabbit-start-time" attribute in
>     >> pacemaker to record the most long lived instance to help pacemaker
>     >> determine the master, but it does not cover the case mentioned
>     above.
>     > We made an assumption what the node with the highest MQ uptime
>     should
>     > know the most about recent cluster state, so other nodes must
>     join it.
>     > RA OCF does not work with queue masters directly.
>     OK. However I still observed "timeout_waiting_for_tables" exception in
>     RabbitMQ log. It's not related to queue master though, sorry. It
>     should
>     be related to the order of shutting down and starting up RabbitMQ
>     (actually the underlying mnesia). So the previous statement should be
>     changed to the following.
>
>     If we continue to shutdown nodes in step (3), we reach a point
>     that the RabbitMQ instances being taken down keeping in mind their
>     mnesia tables should sync from other instances for the next boot,
>     and the last RabbitMQ instance thinks itself is the table syn
>     source, and pacemaker is not aware of all of this.
>
>     I can see there is code to bookkeeping a "rabbit-start-time"
>     attribute in pacemaker to record the most long lived instance to
>     help pacemaker determine the master, but it does not cover the
>     case mentioned above. So chances are the pacemaker master is not
>     the last instance shut down, it then runs into
>     "timeout_waiting_for_tables" during a promotion.
>
>     >> A
>     >> recent patch [5] checks existing "rabbit-master" attribute but it
>     >> neither cover the above case.
>     >>
>     >> So in step (4), pacemaker determines a different master which was a
>     >> RabbitMQ slave last time. It would wait for its original
>     RabbitMQ master
>     >> for 30s and fail, then it gets reset to a standalone cluster.
>     Here we
>     >> get some different clusters, so in step (5) and (6), it is
>     likely to
>     >> report error in log saying timeout waiting for tables or fail
>     to merge
>     >> mnesia database schema, then the those instances get reset. You can
>     >> easily re-produce the case by hard resetting power of all the
>     controllers.
>     >>
>     >> As you can see, if you are unlucky, there would be several "30s
>     timeout
>     >> and reset" before you finally get a healthy RabbitMQ cluster.
>     > The full MQ cluster reassemble logic is far from the perfect state,
>     > indeed. This might erase all mnesia files, hence any custom
>     entities,
>     > like users or vhosts, would be removed as well. Note, we do not
>     > configure durable queues for Openstack so there is nothing to
>     care about
>     > here - the full cluster downtime assumes there will be no AMQP
>     messages
>     > stored at all.
>
>     I also notice we don't have durable queues, that's why I think
>     "force_load" trick and "rabbitmqctl force_boot" is ok.
>     >> I find three possible solutions.
>     >> A. Using rabbitmqctl force_boot option [6]
>     >> It will skips waiting for 30s and resetting cluster, but just
>     assume the
>     >> current node is the master and continue to operate. This is
>     feasible
>     >> because the original RabbitMQ master would discards the local
>     state and
>     >> sync with the new master after it joins a new cluster [7]. So
>     we can be
>     >> sure that after step (4) and (6), the pacemaker determined master
>     >> instance is started unconditionally, and it will be the same as
>     RabbitMQ
>     >> master, and all operations run without 30s timeout. I find this
>     option
>     >> is only available in newer RabbitMQ release, and updating
>     RabbitMQ might
>     >> introduce other compatibility problems.
>     > Yes, this option is only supported for newest RabbitMQ versions.
>     But we
>     > definitely should look how this could help.
>     >
>     >> B. Turn RabbitMQ into cloned instance and use pause_minority
>     instead of
>     >> autoheal [8]
>     > Indeed, there are cases when MQ's autoheal can do nothing with
>     existing
>     > partitions and remains partitioned for ever, for example:
>     >
>     > Masters: [ node-1 ]
>     > Slaves: [ node-2 node-3 ]
>     > root at node-1:~# rabbitmqctl cluster_status
>     > Cluster status of node 'rabbit at node-1' ...
>     > [{nodes,[{disc,['rabbit at node-1','rabbit at node-2']}]},
>     > {running_nodes,['rabbit at node-1']},
>     > {cluster_name,<<"rabbit at node-2">>},
>     > {partitions,[]}]
>     > ...done.
>     > root at node-2:~# rabbitmqctl cluster_status
>     > Cluster status of node 'rabbit at node-2' ...
>     > [{nodes,[{disc,['rabbit at node-2']}]}]
>     > ...done.
>     > root at node-3:~# rabbitmqctl cluster_status
>     > Cluster status of node 'rabbit at node-3' ...
>     > [{nodes,[{disc,['rabbit at node-1','rabbit at node-2','rabbit at node-3']}]},
>     > {running_nodes,['rabbit at node-3']},
>     > {cluster_name,<<"rabbit at node-2">>},
>     > {partitions,[]}]
>     This is terrible. Looks like RabbitMQ bug. I am not sure if
>     "force_load"
>     trick and "rabbitmqctl force_boot" introduces new problems in such
>     case.
>     > So we should test the pause-minority value as well.
>     > But I strongly believe we should make MQ multi-state clone to
>     support
>     > many masters, related bp [7]
>     >
>     > [7]
>     >
>     https://blueprints.launchpad.net/fuel/+spec/rabbitmq-pacemaker-multimaster-clone
>     Looks good. It seems enabling pause-minority does not conflict with a
>     multi-master Rabbit-MQ cluster, maybe we can take this into
>     consideration when doing this BP. It's nice to have RAM nodes to
>     improve
>     performance.
>
>     >> This works like MySQL-wss. It let RabbitMQ cluster itself deal with
>     >> partition in a manner similar to pacemaker quorum mechanism.
>     When there
>     >> is network partition, instances in the minority partition pauses
>     >> themselves automatically. Pacemaker does not have to track who
>     is the
>     >> RabbitMQ master, who lives longest, who to promote... It just
>     starts all
>     >> the clones, done. This leads to huge change in RabbitMQ
>     resource agent,
>     >> and the stability and other impact is to be tested.
>     > Well, we should not mess the queue masters and multi-clone
>     master for MQ
>     > resource in the pacemaker.
>     > As I said, pacemaker RA has nothing to do with queue masters. And we
>     > introduced this "master" mostly in order to support the full cluster
>     > reassemble case - there must be a node promoted and other nodes
>     should join.
>     >
>     >> C. Creating a "force_load" file
>     >> After reading RabbitMQ source code, I find that the actual
>     thing it does
>     >> in solution A is just creating an empty file named "force_load" in
>     >> mnesia database dir, then mnesia thinks it is the last node
>     shut down in
>     >> the last time and boot itself as the master. This
>     implementation keeps
>     >> the same from v3.1.4 to the latest RabbitMQ master branch. I
>     think we
>     >> can make use of this little trick. The change is adding just
>     one line in
>     >> "try_to_start_rmq_app()" function.
>     >>
>     >> touch "${MNESIA_FILES}/force_load" && \
>     >>   chown rabbitmq:rabbitmq "${MNESIA_FILES}/force_load"
>     > This is a very good point, thank you.
>     >
>     >> [4] http://www.rabbitmq.com/ha.html
>     >> [5] https://review.openstack.org/#/c/169291/
>     >> [6] https://www.rabbitmq.com/clustering.html
>     >> [7] http://www.rabbitmq.com/partitions.html#recovering
>     >> [8] http://www.rabbitmq.com/partitions.html#automatic-handling
>     >>
>     >> Maybe you have better ideas on this. Please share your thoughts.
>     > Thank you for a thorough feedback! This was a really great job.
>     Thank you for such good explanation. I was not clear of the queue
>     master
>     and mistook it with mnesia sync source.
>     >> ----
>     >> Best wishes!
>     >> Zhou Zheng Sheng / ???  Software Engineer
>     >> Beijing AWcloud Software Co., Ltd.
>     >>
>
>
>
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20150504/1774dfa8/attachment.html>


More information about the OpenStack-dev mailing list