Open Stack

Tue May 5 06:25:14 UTC 2015

> On 5 May 2015, at 2:31 pm, Zhou Zheng Sheng / 周征晟 <zhengsheng at awcloud.com> wrote:
> 
> Thank you Bogdan for clearing the pacemaker promotion process for me.
> 
> on 2015/05/05 10:32, Andrew Beekhof wrote:
>>> On 29 Apr 2015, at 5:38 pm, Zhou Zheng Sheng / 周征晟 <zhengsheng at awcloud.com> wrote:
>> [snip]
>> 
>>> Batch is a pacemaker concept I found when I was reading its
>>> documentation and code. There is a "batch-limit: 30" in the output of
>>> "pcs property list --all". The pacemaker official documentation
>>> explanation is that it's "The number of jobs that the TE is allowed to
>>> execute in parallel." From my understanding, pacemaker maintains cluster
>>> states, and when we start/stop/promote/demote a resource, it triggers a
>>> state transition. Pacemaker puts as many as possible transition jobs
>>> into a batch, and process them in parallel.
>> Technically it calculates an ordered graph of actions that need to be performed for a set of related resources.
>> You can see an example of the kinds of graphs it produces at:
>> 
>>   http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/s-config-testing-changes.html
>> 
>> There is a more complex one which includes promotion and demotion on the next page.
>> 
>> The number of actions that can run at any one time is therefor limited by
>> - the value of batch-limit (the total number of in-flight actions)
>> - the number of resources that do not have ordering constraints between them (eg. rsc{1,2,3} in the above example)  
>> 
>> So in the above example, if batch-limit >= 3, the monitor_0 actions will still all execute in parallel.
>> If batch-limit == 2, one of them will be deferred until the others complete.
>> 
>> Processing of the graph stops the moment any action returns a value that was not expected.
>> If that happens, we wait for currently in-flight actions to complete, re-calculate a new graph based on the new information and start again.
> So can I infer the following statement? In a big cluster with many
> resources, chances are some resource agent actions return unexpected
> values,

The size of the cluster shouldn’t increase the chance of this happening unless you’ve set the timeouts too aggressively.

> and if any of the in-flight action timeout is long, it would
> block pacemaker from re-calculating a new transition graph?

Yes, but its actually an argument for making the timeouts longer, not shorter.
Setting the timeouts too aggressively actually increases downtime because of all the extra delays and recovery it induces.
So set them to be long enough that there is unquestionably a problem if you hit them.

But we absolutely recognise that starting/stopping a database can take a very long time comparatively and that it shouldn’t block recovery of other unrelated services.
I would expect to see this land in Pacemaker 1.1.14

> I see the
> current batch-limit is 30 and I tried to increase it to 100, but did not
> help.

Correct.  It only puts an upper limit on the number of in-flight actions, actions still need to wait for all their dependants to complete before executing.

> I'm sure that the cloned MySQL Galera resource is not related to
> master-slave RabbitMQ resource. I don't find any dependency, order or
> rule connecting them in the cluster deployed by Fuel [1].

In general it should not have needed to wait, but if you send me a crm_report covering the period you’re talking about I’ll be able to comment specifically about the behaviour you saw.

> 
> Is there anything I can do to make sure all the resource actions return
> expected values in a full reassembling?

In general, if we say ‘start’, do your best to start or return ‘0’ if you already were started.
Likewise for stop.

Otherwise its really specific to your agent.
For example an IP resource just needs to add itself to an interface - it cant do much differently, if it times out then the system much be very very busy.

The only other thing I would say is:
- avoid blocking calls where possible
- have empathy for the machine (do as little as is needed)

> Is it because node-1 and node-2
> happen to boot faster than node-3 and form a cluster, when node-3 joins,
> it triggers new state transition? Or may because some resources are
> already started, so pacemaker needs to stop them firstly?

We only stop them if they shouldn’t yet be running (ie. a colocation or ordering dependancy is not yet started also).

> Does setting
> default-resource-stickiness to 1 help?

From 0 or INFINITY?

> 
> I also tried "crm history XXX" commands in a live and correct cluster,

I’m not familiar with that tool anymore.

> but didn't find much information. I can see there are many log entries
> like "run_graph: Transition 7108 ...". Next I'll inspect the pacemaker
> log to see which resource action returns the unexpected value or which
> thing triggers new state transition.
> 
> [1] http://paste.openstack.org/show/214919/

I’d not recommend mixing the two CLI tools.

> 
>>> The problem is that pacemaker can only promote a resource after it
>>> detects the resource is started.
>> First we do a non-recurring monitor (*_monitor_0) to check what state the resource is in.
>> We can’t assume its off because a) we might have crashed, b) the admin might have accidentally configured it to start at boot or c) the admin may have asked us to re-check everything.
>> 
>>> During a full reassemble, in the first
>>> transition batch, pacemaker starts all the resources including MySQL and
>>> RabbitMQ. Pacemaker issues resource agent "start" invocation in parallel
>>> and reaps the results.
>>> 
>>> For a multi-state resource agent like RabbitMQ, pacemaker needs the
>>> start result reported in the first batch, then transition engine and
>>> policy engine decide if it has to retry starting or promote, and put
>>> this new transition job into a new batch.
>> Also important to know, the order of actions is:
>> 
>> 1. any necessary demotions
>> 2. any necessary stops
>> 3. any necessary starts
>> 4. any necessary promotions
>> 
>> 
>> 
>> __________________________________________________________________________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 
> -- 
> Best wishes!
> Zhou Zheng Sheng / 周征晟  Software Engineer
> Beijing AWcloud Software Co., Ltd.
> 
> 
> 
> 
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Open Stack

[openstack-dev] [Fuel] Speed Up RabbitMQ Recovering

OpenStack

Community

Documentation

Branding & Legal