Open Stack

Tue May 5 04:31:39 UTC 2015

Thank you Bogdan for clearing the pacemaker promotion process for me.

on 2015/05/05 10:32, Andrew Beekhof wrote:
>> On 29 Apr 2015, at 5:38 pm, Zhou Zheng Sheng / 周征晟 <zhengsheng at awcloud.com> wrote:
> [snip]
>
>> Batch is a pacemaker concept I found when I was reading its
>> documentation and code. There is a "batch-limit: 30" in the output of
>> "pcs property list --all". The pacemaker official documentation
>> explanation is that it's "The number of jobs that the TE is allowed to
>> execute in parallel." From my understanding, pacemaker maintains cluster
>> states, and when we start/stop/promote/demote a resource, it triggers a
>> state transition. Pacemaker puts as many as possible transition jobs
>> into a batch, and process them in parallel.
> Technically it calculates an ordered graph of actions that need to be performed for a set of related resources.
> You can see an example of the kinds of graphs it produces at:
>
>    http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/s-config-testing-changes.html
>
> There is a more complex one which includes promotion and demotion on the next page.
>
> The number of actions that can run at any one time is therefor limited by
> - the value of batch-limit (the total number of in-flight actions)
> - the number of resources that do not have ordering constraints between them (eg. rsc{1,2,3} in the above example)  
>
> So in the above example, if batch-limit >= 3, the monitor_0 actions will still all execute in parallel.
> If batch-limit == 2, one of them will be deferred until the others complete.
>
> Processing of the graph stops the moment any action returns a value that was not expected.
> If that happens, we wait for currently in-flight actions to complete, re-calculate a new graph based on the new information and start again.
So can I infer the following statement? In a big cluster with many
resources, chances are some resource agent actions return unexpected
values, and if any of the in-flight action timeout is long, it would
block pacemaker from re-calculating a new transition graph? I see the
current batch-limit is 30 and I tried to increase it to 100, but did not
help. I'm sure that the cloned MySQL Galera resource is not related to
master-slave RabbitMQ resource. I don't find any dependency, order or
rule connecting them in the cluster deployed by Fuel [1].

Is there anything I can do to make sure all the resource actions return
expected values in a full reassembling? Is it because node-1 and node-2
happen to boot faster than node-3 and form a cluster, when node-3 joins,
it triggers new state transition? Or may because some resources are
already started, so pacemaker needs to stop them firstly? Does setting
default-resource-stickiness to 1 help?

I also tried "crm history XXX" commands in a live and correct cluster,
but didn't find much information. I can see there are many log entries
like "run_graph: Transition 7108 ...". Next I'll inspect the pacemaker
log to see which resource action returns the unexpected value or which
thing triggers new state transition.

[1] http://paste.openstack.org/show/214919/

>> The problem is that pacemaker can only promote a resource after it
>> detects the resource is started.
> First we do a non-recurring monitor (*_monitor_0) to check what state the resource is in.
> We can’t assume its off because a) we might have crashed, b) the admin might have accidentally configured it to start at boot or c) the admin may have asked us to re-check everything.
>
>> During a full reassemble, in the first
>> transition batch, pacemaker starts all the resources including MySQL and
>> RabbitMQ. Pacemaker issues resource agent "start" invocation in parallel
>> and reaps the results.
>>
>> For a multi-state resource agent like RabbitMQ, pacemaker needs the
>> start result reported in the first batch, then transition engine and
>> policy engine decide if it has to retry starting or promote, and put
>> this new transition job into a new batch.
> Also important to know, the order of actions is:
>
> 1. any necessary demotions
> 2. any necessary stops
> 3. any necessary starts
> 4. any necessary promotions
>
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

-- 
Best wishes!
Zhou Zheng Sheng / 周征晟  Software Engineer
Beijing AWcloud Software Co., Ltd.

Open Stack

[openstack-dev] [Fuel] Speed Up RabbitMQ Recovering

OpenStack

Community

Documentation

Branding & Legal