Open Stack

Mon Aug 11 22:16:35 UTC 2014

Excerpts from Zane Bitter's message of 2014-08-11 13:35:44 -0700:
> On 11/08/14 14:49, Clint Byrum wrote:
> > Excerpts from Steven Hardy's message of 2014-08-11 11:40:07 -0700:
> >> On Mon, Aug 11, 2014 at 11:20:50AM -0700, Clint Byrum wrote:
> >>> Excerpts from Zane Bitter's message of 2014-08-11 08:16:56 -0700:
> >>>> On 11/08/14 10:46, Clint Byrum wrote:
> >>>>> Right now we're stuck with an update that just doesn't work. It isn't
> >>>>> just about update-failure-recovery, which is coming along nicely, but
> >>>>> it is also about the lack of signals to control rebuild, poor support
> >>>>> for addressing machines as groups, and unacceptable performance in
> >>>>> large stacks.
> >>>>
> >>>> Are there blueprints/bugs filed for all of these issues?
> >>>>
> >>>
> >>> Convergnce addresses the poor performance for large stacks in general.
> >>> We also have this:
> >>>
> >>> https://bugs.launchpad.net/heat/+bug/1306743
> >>>
> >>> Which shows how slow metadata access can get. I have worked on patches
> >>> but haven't been able to complete them. We made big strides but we are
> >>> at a point where 40 nodes polling Heat every 30s is too much for one CPU
> 
> This sounds like the same figure I heard at the design summit; did the 
> DB call optimisation work that Steve Baker did immediately after that 
> not have any effect?
> 

Steve's work got us to 40. From 7.

> >>> to handle. When we scaled Heat out onto more CPUs on one box by forking
> >>> we ran into eventlet issues. We also ran into issues because even with
> >>> many processes we can only use one to resolve templates for a single
> >>> stack during update, which was also excessively slow.
> >>
> >> Related to this, and a discussion we had recently at the TripleO meetup is
> >> this spec I raised today:
> >>
> >> https://review.openstack.org/#/c/113296/
> >>
> >> It's following up on the idea that we could potentially address (or at
> >> least mitigate, pending the fully convergence-ified heat) some of these
> >> scalability concerns, if TripleO moves from the one-giant-template model
> >> to a more modular nested-stack/provider model (e.g what Tomas has been
> >> working on)
> >>
> >> I've not got into enough detail on that yet to be sure if it's acheivable
> >> for Juno, but it seems initially to be complex-but-doable.
> >>
> >> I'd welcome feedback on that idea and how it may fit in with the more
> >> granular convergence-engine model.
> >>
> >> Can you link to the eventlet/forking issues bug please?  I thought since
> >> bug #1321303 was fixed that multiple engines and multiple workers should
> >> work OK, and obviously that being true is a precondition to expending
> >> significant effort on the nested stack decoupling plan above.
> >>
> >
> > That was the issue. So we fixed that bug, but we never un-reverted
> > the patch that forks enough engines to use up all the CPU's on a box
> > by default. That would likely help a lot with metadata access speed
> > (we could manually do it in TripleO but we tend to push defaults. :)
> 
> Right, and we decided we wouldn't because it's wrong to do that to 
> people by default. In some cases the optimal running configuration for 
> TripleO will differ from the friendliest out-of-the-box configuration 
> for Heat users in general, and in those cases - of which this is one - 
> TripleO will need to specify the configuration.
> 

Whether or not the default should be to fork 1 process per CPU is a
debate for another time. The point is, we can safely use the forking in
Heat now to perhaps improve performance of metadata polling.

Chasing that, and other optimizations, has not led us to a place where
we can get to, say, 100 real nodes _today_. We're chasing another way to
get to the scale and capability we need _today_, in much the same way
we did with merge.py. We'll find the way to get it done more elegantly
as time permits.

Open Stack

[openstack-dev] [TripleO][heat] a small experiment with Ansible in TripleO

OpenStack

Community

Documentation

Branding & Legal