<html>

  <head>

    <meta content="text/html; charset=UTF-8" http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    Hello Sergii,<br>

    <br>

    Thank you for the great explanation on Galera OCF script. I replied

    your question inline.<br>

    <br>

    <div class="moz-cite-prefix">on 2015/05/03 04:49, Sergii Golovatiuk

      wrote:<br>

    </div>

    <blockquote

cite="mid:CA+HkNVtfCADs3ATihHXszUKoDD6Xi5XgGEf-dHLVqu3Mzi8irQ@mail.gmail.com"

      type="cite">

      <div dir="ltr">

        <div>

          <div>

            <div>

              <div>Hi Zhou,<br>

                <br>

              </div>

              Galera OCF script is a bit special. Since MySQL keeps the

              most important data we should find the most recent data on

              all nodes across the cluster. check_if_galera_pc is

              specially designed for that. Every server registers the

              latest status from grastate.dat to CIB. Once all nodes are

              registered, the one with the most recent data will be

              selected as Primary Component. All others should join to

              that node. 5 minutes is a time for all nodes to appear and

              register position from grastate.dat to CIB. Usually, it

              takes much faster. Though there are cases when node is

              stuck on fsck or grub or power outlet or some other cases.

              If all nodes are registered there shouldn't be 5 minute

              penalty timeout. If one node is stuck (at least present in

              CIB), then all other nodes will be waiting for 5 minutes

              then will assemble cluster without it.<br>

              <br>

            </div>

            Concerning dependencies, I agree that RabbitMQ may start in

            parallel to Galera cluster assemble procedure. It makes no

            sense to start other services as they are dependent on

            Galera and RabbitMQ.<br>

            <br>

          </div>

          Also, I have a quick question to you. Shutting down all three

          controllers is a unique case, like whole power outage in whole

          datacenter (DC). In this case, 5 minute delay is very small

          comparing to DC recovery procedure. Reboot of one controller

          is more optimistic scenario. What's a special case to restart

          all 3-5 at once?<br>

        </div>

      </div>

    </blockquote>

    <br>

    Sorry, I am not very clear about what "3-5" refers to. Is the

    question about why we want to make the full reassemble time short,

    and why this case is important for us?<br>

    <br>

    We have some small customers forming a long-tail in local market.

    They have neither dedicated datacenter houses nor dual power supply.

    Some of them would even shutdown all the machines when they go home,

    and start all of the machines when they start to work. Considering

    of data privacy, they are not willing to put their virtual machines

    on the public cloud. Usually, this kind of customer don't have IT

    skills to troubleshoot a full reassemble process. We want to make

    this process as simple as turning on all the machines roughly at the

    same time and wait about several minutes, so they don't call our

    service team.<br>

    <br>

    <blockquote

cite="mid:CA+HkNVtfCADs3ATihHXszUKoDD6Xi5XgGEf-dHLVqu3Mzi8irQ@mail.gmail.com"

      type="cite">

      <div dir="ltr">

        <div><br>

        </div>

        Also, I would like to say a big thank for digging it out. It's

        very useful to use your findings in our next steps.<br>

        <div>

          <div>

            <div>

              <div><br>

              </div>

            </div>

          </div>

        </div>

      </div>

      <div class="gmail_extra"><br clear="all">

        <div>

          <div class="gmail_signature">

            <div dir="ltr">--<br>

              Best regards,<br>

              Sergii Golovatiuk,<br>

              Skype #golserge<br>

              IRC #holser<br>

            </div>

          </div>

        </div>

        <br>

        <div class="gmail_quote">On Wed, Apr 29, 2015 at 9:38 AM, Zhou

          Zheng Sheng / 周征晟 <span dir="ltr"><<a

              moz-do-not-send="true"

              href="mailto:zhengsheng@awcloud.com" target="_blank">zhengsheng@awcloud.com</a>></span>

          wrote:<br>

          <blockquote class="gmail_quote" style="margin:0 0 0

            .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi!<br>

            <br>

            Thank you very much Vladimir and Bogdan! Thanks for the fast

            respond and<br>

            rich information.<br>

            <br>

            I backported MySQL and RabbitMQ ocf patches from stable/6.0

            and tested<br>

            again. A full reassemble takes about 5mins, this improves a

            lot. Adding<br>

            the "force_load" trick I mentioned in the previous email, it

            takes about<br>

            4mins.<br>

            <br>

            I get that there is not really a RabbitMQ master instance

            because queue<br>

            masters spreads to all the RabbitMQ instances. The pacemaker

            master is<br>

            an abstract one. However there is still an mnesia node from

            which other<br>

            mnesia nodes sync table schema. The exception<br>

            "timeout_waiting_for_tables" in log is actually reported by

            mnesia. By<br>

            default, it places a mark on the last alive mnesia node, and

            other nodes<br>

            have to sync table from it<br>

            (<a moz-do-not-send="true"

              href="http://www.erlang.org/doc/apps/mnesia/Mnesia_chap7.html#id78477"

              target="_blank">http://www.erlang.org/doc/apps/mnesia/Mnesia_chap7.html#id78477</a>).<br>

            RabbitMQ clustering inherits the behavior, and the last

            RabbitMQ<br>

            instance shutdown must be the first instance to start.

            Otherwise it<br>

            produces "timeout_waiting_for_tables"<br>

            (<a moz-do-not-send="true"

              href="http://www.rabbitmq.com/clustering.html#transcript"

              target="_blank">http://www.rabbitmq.com/clustering.html#transcript</a>

            search for "the last<br>

            node to go down").<br>

            <br>

            The 1 minute difference is because without "force_load", the

            abstract<br>

            master determined by pacemaker during a promote action may

            not be the<br>

            last RabbitMQ instance shut down in the last "start" action.

            So there is<br>

            chance for "rabbitmqctl start_app" to wait 30s and trigger a

            RabbitMQ<br>

            exception "timeout_waiting_for_tables". We may able to see

            table timeout<br>

            and mnesa resetting for once during a reassemble process on

            some of the<br>

            RabbitMQ instances, but it only introduces 30s of wait,

            which is<br>

            acceptable for me.<br>

            <br>

            I also inspect the RabbitMQ resource agent code in latest

            master branch.<br>

            There are timeout wrapper and other improvements which are

            great. It<br>

            does not change the master promotion process much, so it may

            still run<br>

            into the problems I described.<br>

            <br>

            Please see the inline reply below.<br>

            <div>

              <div class="h5"><br>

                on 2015/04/28/ 21:15, Bogdan Dobrelya wrote:<br>

                >> Hello,<br>

                > Hello, Zhou<br>

                ><br>

                >> I using Fuel 6.0.1 and find that RabbitMQ

                recover time is long after<br>

                >> power failure. I have a running HA environment,

                then I reset power of<br>

                >> all the machines at the same time. I observe

                that after reboot it<br>

                >> usually takes 10 minutes for RabittMQ cluster

                to appear running<br>

                >> master-slave mode in pacemaker. If I power off

                all the 3 controllers and<br>

                >> only start 2 of them, the downtime sometimes

                can be as long as 20 minutes.<br>

                > Yes, this is a known issue [0]. Note, there were

                many bugfixes, like<br>

                > [1],[2],[3], merged for MQ OCF script, so you may

                want to try to<br>

                > backport them as well by the following guide [4]<br>

                ><br>

                > [0] <a moz-do-not-send="true"

                  href="https://bugs.launchpad.net/fuel/+bug/1432603"

                  target="_blank">https://bugs.launchpad.net/fuel/+bug/1432603</a><br>

                > [1] <a moz-do-not-send="true"

                  href="https://review.openstack.org/#/c/175460/"

                  target="_blank">https://review.openstack.org/#/c/175460/</a><br>

                > [2] <a moz-do-not-send="true"

                  href="https://review.openstack.org/#/c/175457/"

                  target="_blank">https://review.openstack.org/#/c/175457/</a><br>

                > [3] <a moz-do-not-send="true"

                  href="https://review.openstack.org/#/c/175371/"

                  target="_blank">https://review.openstack.org/#/c/175371/</a><br>

                > [4] <a moz-do-not-send="true"

                  href="https://review.openstack.org/#/c/170476/"

                  target="_blank">https://review.openstack.org/#/c/170476/</a><br>

                ><br>

                >> I have a little investigation and find out

                there are some possible causes.<br>

                >><br>

                >> 1. MySQL Recovery Takes Too Long [1] and

                Blocking RabbitMQ Clustering in<br>

                >> Pacemaker<br>

                >><br>

                >> The pacemaker resource p_mysql start timeout is

                set to 475s. Sometimes<br>

                >> MySQL-wss fails to start after power failure,

                and pacemaker would wait<br>

                >> 475s before retry starting it. The problem is

                that pacemaker divides<br>

                >> resource state transitions into batches. Since

                RabbitMQ is master-slave<br>

                >> resource, I assume that starting all the slaves

                and promoting master are<br>

                >> put into two different batches. If

                unfortunately starting all RabbitMQ<br>

                >> slaves are put in the same batch as MySQL

                starting, even if RabbitMQ<br>

                >> slaves and all other resources are ready,

                pacemaker will not continue<br>

                >> but just wait for MySQL timeout.<br>

                > Could you please elaborate the what is the

                same/different batches for MQ<br>

                > and DB? Note, there is a MQ clustering logic flow

                charts available here<br>

                > [5] and we're planning to release a dedicated

                technical bulletin for this.<br>

                ><br>

                > [5] <a moz-do-not-send="true"

                  href="http://goo.gl/PPNrw7" target="_blank">http://goo.gl/PPNrw7</a><br>

                <br>

              </div>

            </div>

            Batch is a pacemaker concept I found when I was reading its<br>

            documentation and code. There is a "batch-limit: 30" in the

            output of<br>

            "pcs property list --all". The pacemaker official

            documentation<br>

            explanation is that it's "The number of jobs that the TE is

            allowed to<br>

            execute in parallel." From my understanding, pacemaker

            maintains cluster<br>

            states, and when we start/stop/promote/demote a resource, it

            triggers a<br>

            state transition. Pacemaker puts as many as possible

            transition jobs<br>

            into a batch, and process them in parallel.<br>

            <br>

            The problem is that pacemaker can only promote a resource

            after it<br>

            detects the resource is started. During a full reassemble,

            in the first<br>

            transition batch, pacemaker starts all the resources

            including MySQL and<br>

            RabbitMQ. Pacemaker issues resource agent "start" invocation

            in parallel<br>

            and reaps the results.<br>

            <br>

            For a multi-state resource agent like RabbitMQ, pacemaker

            needs the<br>

            start result reported in the first batch, then transition

            engine and<br>

            policy engine decide if it has to retry starting or promote,

            and put<br>

            this new transition job into a new batch.<br>

            <br>

            I see improvements to put individual commands inside a

            timeout wrapper<br>

            in RabbitMQ resource agent, and a bug created yesterday to

            do the same<br>

            for mysql-wss. This should help but there is a loop in

            mysql-wss<br>

            function "check_if_galera_pc". It checks MySQL state each

            10s till<br>

            timeout. From the pacemaker point of view, the resource

            agent invocation<br>

            takes as long as 300s once it enters this function. So even

            if other<br>

            resource agent invocation returns, as long as MySQL resource

            agent does<br>

            not return, the current batch is not done yet , and

            pacemaker does not<br>

            start the next batch. MySQL resource agent has a start

            timeout set to<br>

            300s (previously 475s). During this 300s, the cluster does

            not respond<br>

            to any state transition calls for all the resources. It

            looks as if<br>

            pacemaker gets stuck from the user point of view.<br>

            <span class="">>> I can re-produce this by hard

              powering off all the controllers and start<br>

              >> them again. It's more likely to trigger MySQL

              failure in this way. Then<br>

              >> I observe that if there is one cloned mysql

              instance not starting, the<br>

              >> whole pacemaker cluster gets stuck and does not

              emit any log. On the<br>

              >> host of the failed instance, I can see a mysql

              resource agent process<br>

              >> calling the sleep command. If I kill that

              process, the pacemaker comes<br>

              >> back alive and RabbitMQ master gets promoted. In

              fact this long timeout<br>

              >> is blocking every resource from state transition

              in pacemaker.<br>

              >><br>

              >> This maybe a known problem of pacemaker and there

              are some discussions<br>

              >> in Linux-HA mailing list [2]. It might not be

              fixed in the near future.<br>

              >> It seems in generally it's bad to have long

              timeout in state transition<br>

              >> actions (start/stop/promote/demote). There maybe

              another way to<br>

              >> implement MySQL-wss resource agent to use a short

              start timeout and<br>

              >> monitor the wss cluster state using monitor

              action.<br>

              > This is very interesting, thank you! I believe all

              commands for MySQL RA<br>

              > OCF script should be as well wrapped with timeout

              -SIGTERM or -SIGKILL<br>

              > as we did for MQ RA OCF. And there should no be any

              sleep calls. I<br>

              > created a bug for this [6].<br>

              ><br>

              > [6] <a moz-do-not-send="true"

                href="https://bugs.launchpad.net/fuel/+bug/1449542"

                target="_blank">https://bugs.launchpad.net/fuel/+bug/1449542</a><br>

              <br>

            </span>Thank you! We might not avoid all the sleep calls,

            but I agree most of<br>

            the commands can be put in a timeout wrapper to prevent

            unexpected stall.<br>

            <div>

              <div class="h5">>> I also find a fix to improve

                MySQL start timeout [3]. It shortens the<br>

                >> timeout to 300s. At the time I sending this

                email, I can not find it in<br>

                >> stable/6.0 branch. Maybe the maintainer needs

                to cherry-pick it to<br>

                >> stable/6.0 ?<br>

                >><br>

                >> [1] <a moz-do-not-send="true"

                  href="https://bugs.launchpad.net/fuel/+bug/1441885"

                  target="_blank">https://bugs.launchpad.net/fuel/+bug/1441885</a><br>

                >> [2] <a moz-do-not-send="true"

href="http://lists.linux-ha.org/pipermail/linux-ha/2014-March/047989.html"

                  target="_blank">http://lists.linux-ha.org/pipermail/linux-ha/2014-March/047989.html</a><br>

                >> [3] <a moz-do-not-send="true"

                  href="https://review.openstack.org/#/c/171333/"

                  target="_blank">https://review.openstack.org/#/c/171333/</a><br>

                >><br>

                >><br>

                >> 2. RabbitMQ Resource Agent Breaks Existing

                Cluster<br>

                >><br>

                >> Read the code of the RabbitMQ resource agent, I

                find it does the<br>

                >> following to start RabbitMQ master-slave

                cluster.<br>

                >> On all the controllers:<br>

                >> (1) Start Erlang beam process<br>

                >> (2) Start RabbitMQ App (If failed, reset mnesia

                DB and cluster state)<br>

                >> (3) Stop RabbitMQ App but do not stop the beam

                process<br>

                >><br>

                >> Then in pacemaker, all the RabbitMQ instances

                are in slave state. After<br>

                >> pacemaker determines the master, it does the

                following.<br>

                >> On the to-be-master host:<br>

                >> (4) Start RabbitMQ App (If failed, reset mnesia

                DB and cluster state)<br>

                >> On the slaves hosts:<br>

                >> (5) Start RabbitMQ App (If failed, reset mnesia

                DB and cluster state)<br>

                >> (6) Join RabbitMQ cluster of the master host<br>

                >><br>

                > Yes, something like that. As I mentioned, there

                were several bug fixes<br>

                > in the 6.1 dev, and you can also check the MQ

                clustering flow charts.<br>

                ><br>

                >> As far as I can understand, this process is to

                make sure the master<br>

                >> determined by pacemaker is the same as the

                master determined in RabbitMQ<br>

                >> cluster. If there is no existing cluster, it's

                fine. If it is run<br>

                > after<br>

                ><br>

                > Not exactly. There is no master in mirrored MQ

                cluster. We define the<br>

                > rabbit_hosts configuration option from

                Oslo.messaging. What ensures all<br>

                > queue masters will be spread around all of MQ nodes

                in a long run. And<br>

                > we use a master abstraction only for the Pacemaker

                RA clustering layer.<br>

                > Here, a "master" is the MQ node what joins the rest

                of the MQ nodes.<br>

                ><br>

              </div>

            </div>

            Really thank you for explaining this. I just have a look at

            the output<br>

            of "rabbitmqctl list_queues name slave_pids

            synchronised_slave_pids",<br>

            it's as you said indeed.<br>

            <span class="">>> power failure and recovery, it

              introduces the a new problem.<br>

              > We do erase the node master attribute in CIB for such

              cases. This should<br>

              > not bring problems into the master election logic.<br>

              <br>

            </span>The problem is described at the front of the mail.<br>

            <span class="">>> After power recovery, if some of the

              RabbitMQ instances reach step (2)<br>

              >> roughly at the same time (within 30s which is

              hard coded in RabbitMQ) as<br>

              >> the original RabbitMQ master instance, they form

              the original cluster<br>

              >> again and then shutdown. The other instances

              would have to wait for 30s<br>

              >> before it reports failure waiting for tables, and

              be  reset to a<br>

              >> standalone cluster.<br>

              >><br>

              >> In RabbitMQ documentation [4], it is also

              mentioned that if we shutdown<br>

              >> RabbitMQ master, a new master is elected from the

              rest of slaves. If we<br>

              > (Note, the RabbitMQ documentation mentions *queue*

              masters and slaves,<br>

              > which are not the case for the Pacemaker RA

              clustering abstraction layer.)<br>

            </span>Thank you for clarifying this.<br>

            <span class="">>> continue to shutdown nodes in step

              (3), we reach a point that the last<br>

              >> node is the RabbitMQ master, and pacemaker is not

              aware of it. I can see<br>

              >> there is code to bookkeeping a

              "rabbit-start-time" attribute in<br>

              >> pacemaker to record the most long lived instance

              to help pacemaker<br>

              >> determine the master, but it does not cover the

              case mentioned above.<br>

              > We made an assumption what the node with the highest

              MQ uptime should<br>

              > know the most about recent cluster state, so other

              nodes must join it.<br>

              > RA OCF does not work with queue masters directly.<br>

            </span>OK. However I still observed

            "timeout_waiting_for_tables" exception in<br>

            RabbitMQ log. It's not related to queue master though,

            sorry. It should<br>

            be related to the order of shutting down and starting up

            RabbitMQ<br>

            (actually the underlying mnesia). So the previous statement

            should be<br>

            changed to the following.<br>

            <br>

            If we continue to shutdown nodes in step (3), we reach a

            point that the RabbitMQ instances being taken down keeping

            in mind their mnesia tables should sync from other instances

            for the next boot, and the last RabbitMQ instance thinks

            itself is the table syn source, and pacemaker is not aware

            of all of this.<br>

            <br>

            I can see there is code to bookkeeping a "rabbit-start-time"

            attribute in pacemaker to record the most long lived

            instance to help pacemaker determine the master, but it does

            not cover the case mentioned above. So chances are the

            pacemaker master is not the last instance shut down, it then

            runs into "timeout_waiting_for_tables" during a promotion.<br>

            <span class=""><br>

              >> A<br>

              >> recent patch [5] checks existing "rabbit-master"

              attribute but it<br>

              >> neither cover the above case.<br>

              >><br>

              >> So in step (4), pacemaker determines a different

              master which was a<br>

              >> RabbitMQ slave last time. It would wait for its

              original RabbitMQ master<br>

              >> for 30s and fail, then it gets reset to a

              standalone cluster. Here we<br>

              >> get some different clusters, so in step (5) and

              (6), it is likely to<br>

              >> report error in log saying timeout waiting for

              tables or fail to merge<br>

              >> mnesia database schema, then the those instances

              get reset. You can<br>

              >> easily re-produce the case by hard resetting

              power of all the controllers.<br>

              >><br>

              >> As you can see, if you are unlucky, there would

              be several "30s timeout<br>

              >> and reset" before you finally get a healthy

              RabbitMQ cluster.<br>

              > The full MQ cluster reassemble logic is far from the

              perfect state,<br>

              > indeed. This might erase all mnesia files, hence any

              custom entities,<br>

              > like users or vhosts, would be removed as well. Note,

              we do not<br>

              > configure durable queues for Openstack so there is

              nothing to care about<br>

              > here - the full cluster downtime assumes there will

              be no AMQP messages<br>

              > stored at all.<br>

              <br>

            </span>I also notice we don't have durable queues, that's

            why I think<br>

            "force_load" trick and "rabbitmqctl force_boot" is ok.<br>

            <div>

              <div class="h5">>> I find three possible solutions.<br>

                >> A. Using rabbitmqctl force_boot option [6]<br>

                >> It will skips waiting for 30s and resetting

                cluster, but just assume the<br>

                >> current node is the master and continue to

                operate. This is feasible<br>

                >> because the original RabbitMQ master would

                discards the local state and<br>

                >> sync with the new master after it joins a new

                cluster [7]. So we can be<br>

                >> sure that after step (4) and (6), the pacemaker

                determined master<br>

                >> instance is started unconditionally, and it

                will be the same as RabbitMQ<br>

                >> master, and all operations run without 30s

                timeout. I find this option<br>

                >> is only available in newer RabbitMQ release,

                and updating RabbitMQ might<br>

                >> introduce other compatibility problems.<br>

                > Yes, this option is only supported for newest

                RabbitMQ versions. But we<br>

                > definitely should look how this could help.<br>

                ><br>

                >> B. Turn RabbitMQ into cloned instance and use

                pause_minority instead of<br>

                >> autoheal [8]<br>

                > Indeed, there are cases when MQ's autoheal can do

                nothing with existing<br>

                > partitions and remains partitioned for ever, for

                example:<br>

                ><br>

                > Masters: [ node-1 ]<br>

                > Slaves: [ node-2 node-3 ]<br>

                > root@node-1:~# rabbitmqctl cluster_status<br>

                > Cluster status of node 'rabbit@node-1' ...<br>

                >

                [{nodes,[{disc,['rabbit@node-1','rabbit@node-2']}]},<br>

                > {running_nodes,['rabbit@node-1']},<br>

                > {cluster_name,<<"rabbit@node-2">>},<br>

                > {partitions,[]}]<br>

                > ...done.<br>

                > root@node-2:~# rabbitmqctl cluster_status<br>

                > Cluster status of node 'rabbit@node-2' ...<br>

                > [{nodes,[{disc,['rabbit@node-2']}]}]<br>

                > ...done.<br>

                > root@node-3:~# rabbitmqctl cluster_status<br>

                > Cluster status of node 'rabbit@node-3' ...<br>

                >

                [{nodes,[{disc,['rabbit@node-1','rabbit@node-2','rabbit@node-3']}]},<br>

                > {running_nodes,['rabbit@node-3']},<br>

                > {cluster_name,<<"rabbit@node-2">>},<br>

                > {partitions,[]}]<br>

              </div>

            </div>

            This is terrible. Looks like RabbitMQ bug. I am not sure if

            "force_load"<br>

            trick and "rabbitmqctl force_boot" introduces new problems

            in such case.<br>

            <span class="">> So we should test the pause-minority

              value as well.<br>

              > But I strongly believe we should make MQ multi-state

              clone to support<br>

              > many masters, related bp [7]<br>

              ><br>

              > [7]<br>

              > <a moz-do-not-send="true"

href="https://blueprints.launchpad.net/fuel/+spec/rabbitmq-pacemaker-multimaster-clone"

                target="_blank">https://blueprints.launchpad.net/fuel/+spec/rabbitmq-pacemaker-multimaster-clone</a><br>

            </span>Looks good. It seems enabling pause-minority does not

            conflict with a<br>

            multi-master Rabbit-MQ cluster, maybe we can take this into<br>

            consideration when doing this BP. It's nice to have RAM

            nodes to improve<br>

            performance.<br>

            <div>

              <div class="h5"><br>

                >> This works like MySQL-wss. It let RabbitMQ

                cluster itself deal with<br>

                >> partition in a manner similar to pacemaker

                quorum mechanism. When there<br>

                >> is network partition, instances in the minority

                partition pauses<br>

                >> themselves automatically. Pacemaker does not

                have to track who is the<br>

                >> RabbitMQ master, who lives longest, who to

                promote... It just starts all<br>

                >> the clones, done. This leads to huge change in

                RabbitMQ resource agent,<br>

                >> and the stability and other impact is to be

                tested.<br>

                > Well, we should not mess the queue masters and

                multi-clone master for MQ<br>

                > resource in the pacemaker.<br>

                > As I said, pacemaker RA has nothing to do with

                queue masters. And we<br>

                > introduced this "master" mostly in order to support

                the full cluster<br>

                > reassemble case - there must be a node promoted and

                other nodes should join.<br>

                ><br>

                >> C. Creating a "force_load" file<br>

                >> After reading RabbitMQ source code, I find that

                the actual thing it does<br>

                >> in solution A is just creating an empty file

                named "force_load" in<br>

                >> mnesia database dir, then mnesia thinks it is

                the last node shut down in<br>

                >> the last time and boot itself as the master.

                This implementation keeps<br>

                >> the same from v3.1.4 to the latest RabbitMQ

                master branch. I think we<br>

                >> can make use of this little trick. The change

                is adding just one line in<br>

                >> "try_to_start_rmq_app()" function.<br>

                >><br>

                >> touch "${MNESIA_FILES}/force_load" && \<br>

                >>   chown rabbitmq:rabbitmq

                "${MNESIA_FILES}/force_load"<br>

                > This is a very good point, thank you.<br>

                ><br>

                >> [4] <a moz-do-not-send="true"

                  href="http://www.rabbitmq.com/ha.html" target="_blank">http://www.rabbitmq.com/ha.html</a><br>

                >> [5] <a moz-do-not-send="true"

                  href="https://review.openstack.org/#/c/169291/"

                  target="_blank">https://review.openstack.org/#/c/169291/</a><br>

                >> [6] <a moz-do-not-send="true"

                  href="https://www.rabbitmq.com/clustering.html"

                  target="_blank">https://www.rabbitmq.com/clustering.html</a><br>

                >> [7] <a moz-do-not-send="true"

                  href="http://www.rabbitmq.com/partitions.html#recovering"

                  target="_blank">http://www.rabbitmq.com/partitions.html#recovering</a><br>

                >> [8] <a moz-do-not-send="true"

                  href="http://www.rabbitmq.com/partitions.html#automatic-handling"

                  target="_blank">http://www.rabbitmq.com/partitions.html#automatic-handling</a><br>

                >><br>

                >> Maybe you have better ideas on this. Please

                share your thoughts.<br>

                > Thank you for a thorough feedback! This was a

                really great job.<br>

              </div>

            </div>

            Thank you for such good explanation. I was not clear of the

            queue master<br>

            and mistook it with mnesia sync source.<br>

            <div class="HOEnZb">

              <div class="h5">>> ----<br>

                >> Best wishes!<br>

                >> Zhou Zheng Sheng / ???  Software Engineer<br>

                >> Beijing AWcloud Software Co., Ltd.<br>

                >><br>

                <br>

                <br>

              </div>

            </div>

          </blockquote>

        </div>

        <br>

      </div>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <pre wrap="">__________________________________________________________________________

OpenStack Development Mailing List (not for usage questions)

Unsubscribe: <a class="moz-txt-link-abbreviated" href="mailto:OpenStack-dev-request@lists.openstack.org?subject:unsubscribe">OpenStack-dev-request@lists.openstack.org?subject:unsubscribe</a>

<a class="moz-txt-link-freetext" href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a></pre>

    </blockquote>

  </body>

</html>