<html>

  <head>

    <meta content="text/html; charset=UTF-8" http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    Hello,<br>

    <br>

    I using Fuel 6.0.1 and find that RabbitMQ recover time is long after

    power failure. I have a running HA environment, then I reset power

    of all the machines at the same time. I observe that after reboot it

    usually takes 10 minutes for RabittMQ cluster to appear running

    master-slave mode in pacemaker. If I power off all the 3 controllers

    and only start 2 of them, the downtime sometimes can be as long as

    20 minutes.<br>

    <br>

    I have a little investigation and find out there are some possible

    causes.<br>

    <br>

    1. MySQL Recovery Takes Too Long [1] and Blocking RabbitMQ

    Clustering in Pacemaker<br>

    <br>

    The pacemaker resource p_mysql start timeout is set to 475s.

    Sometimes MySQL-wss fails to start after power failure, and

    pacemaker would wait 475s before retry starting it. The problem is

    that pacemaker divides resource state transitions into batches.

    Since RabbitMQ is master-slave resource, I assume that starting all

    the slaves and promoting master are put into two different batches.

    If unfortunately starting all RabbitMQ slaves are put in the same

    batch as MySQL starting, even if RabbitMQ slaves and all other

    resources are ready, pacemaker will not continue but just wait for

    MySQL timeout.<br>

    <br>

    I can re-produce this by hard powering off all the controllers and

    start them again. It's more likely to trigger MySQL failure in this

    way. Then I observe that if there is one cloned mysql instance not

    starting, the whole pacemaker cluster gets stuck and does not emit

    any log. On the host of the failed instance, I can see a mysql

    resource agent process calling the sleep command. If I kill that

    process, the pacemaker comes back alive and RabbitMQ master gets

    promoted. In fact this long timeout is blocking every resource from

    state transition in pacemaker.<br>

    <br>

    This maybe a known problem of pacemaker and there are some

    discussions in Linux-HA mailing list [2]. It might not be fixed in

    the near future. It seems in generally it's bad to have long timeout

    in state transition actions (start/stop/promote/demote). There maybe

    another way to implement MySQL-wss resource agent to use a short

    start timeout and monitor the wss cluster state using monitor

    action.<br>

    <br>

    I also find a fix to improve MySQL start timeout [3]. It shortens

    the timeout to 300s. At the time I sending this email, I can not

    find it in stable/6.0 branch. Maybe the maintainer needs to

    cherry-pick it to stable/6.0 ?<br>

    <br>

    [1] <a class="moz-txt-link-freetext" href="https://bugs.launchpad.net/fuel/+bug/1441885">https://bugs.launchpad.net/fuel/+bug/1441885</a><br>

    [2]

    <a class="moz-txt-link-freetext" href="http://lists.linux-ha.org/pipermail/linux-ha/2014-March/047989.html">http://lists.linux-ha.org/pipermail/linux-ha/2014-March/047989.html</a><br>

    [3] <a class="moz-txt-link-freetext" href="https://review.openstack.org/#/c/171333/">https://review.openstack.org/#/c/171333/</a><br>

    <br>

    <br>

    2. RabbitMQ Resource Agent Breaks Existing Cluster<br>

    <br>

    Read the code of the RabbitMQ resource agent, I find it does the

    following to start RabbitMQ master-slave cluster.<br>

    On all the controllers:<br>

    (1) Start Erlang beam process<br>

    (2) Start RabbitMQ App (If failed, reset mnesia DB and cluster

    state)<br>

    (3) Stop RabbitMQ App but do not stop the beam process<br>

    <br>

    Then in pacemaker, all the RabbitMQ instances are in slave state.

    After pacemaker determines the master, it does the following.<br>

    On the to-be-master host:<br>

    (4) Start RabbitMQ App (If failed, reset mnesia DB and cluster

    state)<br>

    On the slaves hosts:<br>

    (5) Start RabbitMQ App (If failed, reset mnesia DB and cluster

    state)<br>

    (6) Join RabbitMQ cluster of the master host<br>

    <br>

    As far as I can understand, this process is to make sure the master

    determined by pacemaker is the same as the master determined in

    RabbitMQ cluster. If there is no existing cluster, it's fine. If it

    is run after power failure and recovery, it introduces the a new

    problem.<br>

    <br>

    After power recovery, if some of the RabbitMQ instances reach step

    (2) roughly at the same time (within 30s which is hard coded in

    RabbitMQ) as the original RabbitMQ master instance, they form the

    original cluster again and then shutdown. The other instances would

    have to wait for 30s before it reports failure waiting for tables,

    and be 

    reset to a standalone cluster.<br>

    <br>

    In RabbitMQ documentation [4], it is also mentioned that if we

    shutdown RabbitMQ master, a new master is elected from the rest of

    slaves. If we continue to shutdown nodes in step (3), we reach a

    point that the last node is the RabbitMQ master, and pacemaker is

    not aware of it. I can see there is code to bookkeeping a

    "rabbit-start-time" attribute in pacemaker to record the most long

    lived instance to help pacemaker determine the master, but it does

    not cover the case mentioned above. A recent patch [5] checks

    existing "rabbit-master" attribute but it neither cover the above

    case.<br>

    <br>

    So in step (4), pacemaker determines a different master which was a

    RabbitMQ slave last time. It would wait for its original RabbitMQ

    master for 30s and fail, then it gets reset to a standalone cluster.

    Here we get some different clusters, so in step (5) and (6), it is

    likely to report error in log saying timeout waiting for tables or

    fail to merge mnesia database schema, then the those instances get

    reset. You can easily re-produce the case by hard resetting power of

    all the controllers.<br>

    <br>

    As you can see, if you are unlucky, there would be several "30s

    timeout and reset" before you finally get a healthy RabbitMQ

    cluster.<br>

    <br>

    I find three possible solutions.<br>

    A. Using rabbitmqctl force_boot option [6]<br>

    It will skips waiting for 30s and resetting cluster, but just assume

    the current node is the master and continue to operate. This is

    feasible because the original RabbitMQ master would discards the

    local state and sync with the new master after it joins a new

    cluster [7]. So we can be sure that after step (4) and (6), the

    pacemaker determined master instance is started unconditionally, and

    it will be the same as RabbitMQ master, and all operations run

    without 30s timeout. I find this option is only available in newer

    RabbitMQ release, and updating RabbitMQ might introduce other

    compatibility problems.<br>

    <br>

    B. Turn RabbitMQ into cloned instance and use pause_minority instead

    of autoheal [8]<br>

    This works like MySQL-wss. It let RabbitMQ cluster itself deal with

    partition in a manner similar to pacemaker quorum mechanism. When

    there is network partition, instances in the minority partition

    pauses themselves automatically. Pacemaker does not have to track

    who is the RabbitMQ master, who lives longest, who to promote... It

    just starts all the clones, done. This leads to huge change in

    RabbitMQ resource agent, and the stability and other impact is to be

    tested.<br>

    <br>

    C. Creating a "force_load" file<br>

    After reading RabbitMQ source code, I find that the actual thing it

    does in solution A is just creating an empty file named "force_load"

    in mnesia database dir, then mnesia thinks it is the last node shut

    down in the last time and boot itself as the master. This

    implementation keeps the same from v3.1.4 to the latest RabbitMQ

    master branch. I think we can make use of this little trick. The

    change is adding just one line in "try_to_start_rmq_app()" function.<br>

    <br>

    touch "${MNESIA_FILES}/force_load" && \<br>

      chown rabbitmq:rabbitmq "${MNESIA_FILES}/force_load"<br>

    <br>

    [4] <a class="moz-txt-link-freetext" href="http://www.rabbitmq.com/ha.html">http://www.rabbitmq.com/ha.html</a><br>

    [5] <a class="moz-txt-link-freetext" href="https://review.openstack.org/#/c/169291/">https://review.openstack.org/#/c/169291/</a><br>

    [6] <a class="moz-txt-link-freetext" href="https://www.rabbitmq.com/clustering.html">https://www.rabbitmq.com/clustering.html</a><br>

    [7] <a class="moz-txt-link-freetext" href="http://www.rabbitmq.com/partitions.html#recovering">http://www.rabbitmq.com/partitions.html#recovering</a><br>

    [8] <a class="moz-txt-link-freetext" href="http://www.rabbitmq.com/partitions.html#automatic-handling">http://www.rabbitmq.com/partitions.html#automatic-handling</a><br>

    <br>

    Maybe you have better ideas on this. Please share your thoughts.<br>

    <br>

    ----<br>

    <meta http-equiv="content-type" content="text/html; charset=UTF-8">

    <title></title>

    <meta name="generator" content="LibreOffice 4.2.8.2 (Linux)">

    <style type="text/css">

        <!--

                @page { margin: 2cm }

                p { margin-bottom: 0.25cm; direction: ltr; line-height: 120%; text-align: left; widows: 2; orphans: 2 }

                p.western { font-family: "Arial", serif; font-size: 10pt }

                p.cjk { font-family: "Times New Roman"; font-size: 10pt }

                p.ctl { font-family: "Arial"; font-size: 10pt }

                a:link { color: #0000ff; so-language: zxx }

        -->

        </style>Best wishes!<br>

    Zhou Zheng Sheng / 周征晟  Software Engineer<br>

    Beijing AWcloud Software Co., Ltd.<br>

  </body>

</html>