[openstack-dev] [Fuel] Speed Up RabbitMQ Recovering

Andrew Beekhof abeekhof at redhat.com
Fri May 8 00:51:04 UTC 2015


> On 5 May 2015, at 9:30 pm, Zhou Zheng Sheng / 周征晟 <zhengsheng at awcloud.com> wrote:
> 
> Thank you Andrew. Sorry for misspell your name in the previous email.
> 
> on 2015/05/05 14:25, Andrew Beekhof wrote:
>>> On 5 May 2015, at 2:31 pm, Zhou Zheng Sheng / 周征晟 <zhengsheng at awcloud.com> wrote:
>>> 
>>> Thank you Bogdan for clearing the pacemaker promotion process for me.
>>> 
>>> on 2015/05/05 10:32, Andrew Beekhof wrote:
>>>>> On 29 Apr 2015, at 5:38 pm, Zhou Zheng Sheng / 周征晟 <zhengsheng at awcloud.com> wrote:
>>>> [snip]
>>>> 
>>>>> Batch is a pacemaker concept I found when I was reading its
>>>>> documentation and code. There is a "batch-limit: 30" in the output of
>>>>> "pcs property list --all". The pacemaker official documentation
>>>>> explanation is that it's "The number of jobs that the TE is allowed to
>>>>> execute in parallel." From my understanding, pacemaker maintains cluster
>>>>> states, and when we start/stop/promote/demote a resource, it triggers a
>>>>> state transition. Pacemaker puts as many as possible transition jobs
>>>>> into a batch, and process them in parallel.
>>>> Technically it calculates an ordered graph of actions that need to be performed for a set of related resources.
>>>> You can see an example of the kinds of graphs it produces at:
>>>> 
>>>>  http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/s-config-testing-changes.html
>>>> 
>>>> There is a more complex one which includes promotion and demotion on the next page.
>>>> 
>>>> The number of actions that can run at any one time is therefor limited by
>>>> - the value of batch-limit (the total number of in-flight actions)
>>>> - the number of resources that do not have ordering constraints between them (eg. rsc{1,2,3} in the above example)  
>>>> 
>>>> So in the above example, if batch-limit >= 3, the monitor_0 actions will still all execute in parallel.
>>>> If batch-limit == 2, one of them will be deferred until the others complete.
>>>> 
>>>> Processing of the graph stops the moment any action returns a value that was not expected.
>>>> If that happens, we wait for currently in-flight actions to complete, re-calculate a new graph based on the new information and start again.
>>> So can I infer the following statement? In a big cluster with many
>>> resources, chances are some resource agent actions return unexpected
>>> values,
>> The size of the cluster shouldn’t increase the chance of this happening unless you’ve set the timeouts too aggressively.
> 
> If there are many types of resource agents, and anyone of them is not
> well written, it might cause trouble, right?

Yes, but really only for the things that depend on it.

For example if resources B, C, D, E all depend (in some way) on A, then their startup is going to be delayed.
But F, G, H and J will be able to start while we wait around for B to time out.

> 
>>> and if any of the in-flight action timeout is long, it would
>>> block pacemaker from re-calculating a new transition graph?
>> Yes, but its actually an argument for making the timeouts longer, not shorter.
>> Setting the timeouts too aggressively actually increases downtime because of all the extra delays and recovery it induces.
>> So set them to be long enough that there is unquestionably a problem if you hit them.
>> 
>> But we absolutely recognise that starting/stopping a database can take a very long time comparatively and that it shouldn’t block recovery of other unrelated services.
>> I would expect to see this land in Pacemaker 1.1.14
> 
> It will be great to see this in Pacemaker 1.1.14. From my experience
> using Pacemaker, I think customized resource agents are possibly the
> weakest part.

This is why we encourage people wanting new agents to get involved with the upstream resource-agents project :-)

> This feature should improve the handling for resource
> action timeouts.
> 
>>> I see the
>>> current batch-limit is 30 and I tried to increase it to 100, but did not
>>> help.
>> Correct.  It only puts an upper limit on the number of in-flight actions, actions still need to wait for all their dependants to complete before executing.
>> 
>>> I'm sure that the cloned MySQL Galera resource is not related to
>>> master-slave RabbitMQ resource. I don't find any dependency, order or
>>> rule connecting them in the cluster deployed by Fuel [1].
>> In general it should not have needed to wait, but if you send me a crm_report covering the period you’re talking about I’ll be able to comment specifically about the behaviour you saw.
> 
> You are very nice, thank you. I uploaded the file generated by
> crm_report to google drive.
> 
> https://drive.google.com/file/d/0B_vDkYRYHPSIZ29NdzV3NXotYU0/view?usp=sharing

Hmmm... there’s no logs included here for some reason.
I suspect it a bug on my part, can you apply this patch to report.collector on the machine you’re running crm_report from and retry?

   https://github.com/ClusterLabs/pacemaker/commit/96427ec


> 
>>> Is there anything I can do to make sure all the resource actions return
>>> expected values in a full reassembling?
>> In general, if we say ‘start’, do your best to start or return ‘0’ if you already were started.
>> Likewise for stop.
>> 
>> Otherwise its really specific to your agent.
>> For example an IP resource just needs to add itself to an interface - it cant do much differently, if it times out then the system much be very very busy.
>> 
>> The only other thing I would say is:
>> - avoid blocking calls where possible
>> - have empathy for the machine (do as little as is needed)
>> 
> 
> +1 for the empathy :)
>>> Is it because node-1 and node-2
>>> happen to boot faster than node-3 and form a cluster, when node-3 joins,
>>> it triggers new state transition? Or may because some resources are
>>> already started, so pacemaker needs to stop them firstly?
>> We only stop them if they shouldn’t yet be running (ie. a colocation or ordering dependancy is not yet started also).
>> 
>> 
>>> Does setting
>>> default-resource-stickiness to 1 help?
>> From 0 or INFINITY?
> 
> From 0 to 1. Is it enough for preventing the resource from being moved
> when some nodes recovered from power failure?

From 0 it would help.
But potentially consider INFINITY if the only circumstance you want something moved is if the node is unavailable (either because its dead or in standby mode).

> 
>>> I also tried "crm history XXX" commands in a live and correct cluster,
>> I’m not familiar with that tool anymore.
>> 
>>> but didn't find much information. I can see there are many log entries
>>> like "run_graph: Transition 7108 ...". Next I'll inspect the pacemaker
>>> log to see which resource action returns the unexpected value or which
>>> thing triggers new state transition.
>>> 
>>> [1] http://paste.openstack.org/show/214919/
>> I’d not recommend mixing the two CLI tools.
>> 
>>>>> The problem is that pacemaker can only promote a resource after it
>>>>> detects the resource is started.
>>>> First we do a non-recurring monitor (*_monitor_0) to check what state the resource is in.
>>>> We can’t assume its off because a) we might have crashed, b) the admin might have accidentally configured it to start at boot or c) the admin may have asked us to re-check everything.
>>>> 
>>>>> During a full reassemble, in the first
>>>>> transition batch, pacemaker starts all the resources including MySQL and
>>>>> RabbitMQ. Pacemaker issues resource agent "start" invocation in parallel
>>>>> and reaps the results.
>>>>> 
>>>>> For a multi-state resource agent like RabbitMQ, pacemaker needs the
>>>>> start result reported in the first batch, then transition engine and
>>>>> policy engine decide if it has to retry starting or promote, and put
>>>>> this new transition job into a new batch.
>>>> Also important to know, the order of actions is:
>>>> 
>>>> 1. any necessary demotions
>>>> 2. any necessary stops
>>>> 3. any necessary starts
>>>> 4. any necessary promotions
>>>> 
>>>> 
>>>> 
>>>> __________________________________________________________________________
>>>> OpenStack Development Mailing List (not for usage questions)
>>>> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>> -- 
>>> Best wishes!
>>> Zhou Zheng Sheng / 周征晟  Software Engineer
>>> Beijing AWcloud Software Co., Ltd.
>>> 
>>> 
>>> 
>>> 
>>> __________________________________________________________________________
>>> OpenStack Development Mailing List (not for usage questions)
>>> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>> 
>> __________________________________________________________________________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 
> 
> 
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




More information about the OpenStack-dev mailing list