[openstack-dev] [Heat] Short term scaling strategies for large Heat stacks

Steve Baker sbaker at redhat.com
Mon Jun 2 21:37:25 UTC 2014


On 31/05/14 07:01, Zane Bitter wrote:
> On 29/05/14 19:52, Clint Byrum wrote:
>
>> update-failure-recovery
>> =======================
>>
>> This is a blueprint I believe Zane is working on to land in Juno. It
>> will
>> allow us to retry a failed create or update action. Combined with the
>> separate controller/compute node strategy, this may be our best option,
>> but it is unclear whether that code will be available soon or not. The
>> chunking is definitely required, because with 500 compute nodes, if
>> node #250 fails, the remaining 249 nodes that are IN_PROGRESS will be
>> cancelled, which makes the impact of a transient failure quite extreme.
>> Also without chunking, we'll suffer from some of the performance
>> problems we've seen where a single engine process will have to do all of
>> the work to bring up a stack.
>>
>> Pros: * Uses blessed strategy
>>
>> Cons: * Implementation is not complete
>>       * Still suffers from heavy impact of failure
>>       * Requires chunking to be feasible
>
> I've already started working on this and I'm expecting to have this
> ready some time between the j-1 and j-2 milestones.
>
> I think these two strategies combined could probably get you a long
> way in the short term, though obviously they are not a replacement for
> the convergence strategy in the long term.
>
>
> BTW You missed off another strategy that we have discussed in the
> past, and which I think Steve Baker might(?) be working on: retrying
> failed calls at the client level.
>
As part of the client-plugins blueprint I'm planning on implementing
retry policies on API calls. So when currently we call:
self.nova().servers.create(**kwargs)

This will soon be:
self.client().servers.create(**kwargs)

And with a retry policy (assuming the default unique-ish server name is
used):
self.client_plugin().call_with_retry_policy('cleanup_yr_mess_and_try_again',
self.client().servers.create, **kwargs)

This should be suitable for handling transient errors on API calls such
as 500s, response timeouts or token expiration. It shouldn't be used for
resources which later come up in an ERROR state; convergence or
update-failure-recovery would be better for that.

These policies can start out simple and hard-coded, but there is
potential for different policies to be specified in heat.conf to cater
for the specific failure modes of a given cloud.

Expected to be ready j-1 -> j-2



More information about the OpenStack-dev mailing list