[openstack-dev] [Heat] Short term scaling strategies for large Heat stacks

Clint Byrum clint at fewbar.com
Fri May 30 01:09:18 UTC 2014


Excerpts from Mike Spreitzer's message of 2014-05-30 05:42:43 +0530:
> Clint Byrum <clint at fewbar.com> wrote on 05/29/2014 07:52:07 PM:
> 
> > I am writing to get some brainstorming started on how we might mitigate
> > some of the issues we've seen while deploying large stacks on Heat. I am
> > sending this to the dev list because it may involve landing fixes rather
> > than just using different strategies. The problems outlined here are
> > well known and reported as bugs or feature requests, but there may be
> > more that we can do.
> > 
> > ...
> > 
> > Strategies:
> > 
> > ...
> > 
> > update-failure-recovery
> > =======================
> > 
> > This is a blueprint I believe Zane is working on to land in Juno. It 
> will
> > allow us to retry a failed create or update action. Combined with the
> > separate controller/compute node strategy, this may be our best option,
> > but it is unclear whether that code will be available soon or not. The
> > chunking is definitely required, because with 500 compute nodes, if
> > node #250 fails, the remaining 249 nodes that are IN_PROGRESS will be
> > cancelled, which makes the impact of a transient failure quite extreme.
> > Also without chunking, we'll suffer from some of the performance
> > problems we've seen where a single engine process will have to do all of
> > the work to bring up a stack.
> > 
> > Pros: * Uses blessed strategy
> > 
> > Cons: * Implementation is not complete
> >       * Still suffers from heavy impact of failure
> >       * Requires chunking to be feasible
> 
> I like this one.  As I remarked in the convergence discussion, I think the 
> first step there is a DB schema change to separate desired and observed 
> state.  Once that is done, failure on one resource need not wedge a stack; 
> non-dependent resources (like the peer compute nodes) can still be 
> created.

It's not just the observed state that you need in the database to resume.

You also need the parameters and template snippet that has been
successfully applied.



More information about the OpenStack-dev mailing list