[openstack-dev] [Heat] Short term scaling strategies for large Heat stacks
clint at fewbar.com
Tue Jun 3 00:28:10 UTC 2014
Excerpts from Steve Baker's message of 2014-06-02 14:37:25 -0700:
> On 31/05/14 07:01, Zane Bitter wrote:
> > On 29/05/14 19:52, Clint Byrum wrote:
> >> update-failure-recovery
> >> =======================
> >> This is a blueprint I believe Zane is working on to land in Juno. It
> >> will
> >> allow us to retry a failed create or update action. Combined with the
> >> separate controller/compute node strategy, this may be our best option,
> >> but it is unclear whether that code will be available soon or not. The
> >> chunking is definitely required, because with 500 compute nodes, if
> >> node #250 fails, the remaining 249 nodes that are IN_PROGRESS will be
> >> cancelled, which makes the impact of a transient failure quite extreme.
> >> Also without chunking, we'll suffer from some of the performance
> >> problems we've seen where a single engine process will have to do all of
> >> the work to bring up a stack.
> >> Pros: * Uses blessed strategy
> >> Cons: * Implementation is not complete
> >> * Still suffers from heavy impact of failure
> >> * Requires chunking to be feasible
> > I've already started working on this and I'm expecting to have this
> > ready some time between the j-1 and j-2 milestones.
> > I think these two strategies combined could probably get you a long
> > way in the short term, though obviously they are not a replacement for
> > the convergence strategy in the long term.
> > BTW You missed off another strategy that we have discussed in the
> > past, and which I think Steve Baker might(?) be working on: retrying
> > failed calls at the client level.
> As part of the client-plugins blueprint I'm planning on implementing
> retry policies on API calls. So when currently we call:
> This will soon be:
> And with a retry policy (assuming the default unique-ish server name is
> self.client().servers.create, **kwargs)
> This should be suitable for handling transient errors on API calls such
> as 500s, response timeouts or token expiration. It shouldn't be used for
> resources which later come up in an ERROR state; convergence or
> update-failure-recovery would be better for that.
Steve this is fantastic work and sorely needed. Thank you for working on
Unfortunately, ERROR state machines is the majority of our problem. IPMI
and PXE can be unreliable in some environments, and sometimes machines
are broken in subtle ways. Also, the odd bug in Neutron, Nova, or Ironic
will cause this.
Convergence is not available to us for the short term, and really
update-failure-recovery is some time off too, so we need more solutions
More information about the OpenStack-dev