[openstack-dev] [heat] operators vs users for choosing convergence engine

Clint Byrum clint at fewbar.com
Tue Feb 3 19:12:00 UTC 2015


Excerpts from Zane Bitter's message of 2015-02-03 10:00:44 -0800:
> On 02/02/15 19:52, Steve Baker wrote:
> > A spec has been raised to add a config option to allow operators to
> > choose whether to use the new convergence engine for stack operations.
> > For some context you should read the spec first [1]
> >
> > Rather than doing this, I would like to propose the following:
> 
> I am strongly, strongly opposed to making this part of the API.
> 
> > * Users can (optionally) choose which engine to use by specifying an
> > engine parameter on stack-create (choice of classic or convergence)
> > * Operators can set a config option which determines which engine to use
> > if the user makes no explicit choice
> > * Heat developers will set the default config option from classic to
> > convergence when convergence is deemed sufficiently mature
> 
> We'd also need a way for operators to prevent users from enabling 
> convergence if they're not ready to support it.
> 

This would be relatively simple to do by simply providing a list of the
supported stack versions.

> > I realize it is not ideal to expose this kind of internal implementation
> > detail to the user, but choosing convergence _will_ result in different
> > stack behaviour (such as multiple concurrent update operations) so there
> > is an argument for giving the user the choice. Given enough supporting
> > documentation they can choose whether convergence might be worth trying
> > for a given stack (for example, a large stack which receives frequent
> > updates)
> 
> It's supposed to be a strict improvement; we don't need to ask 
> permission. We have made major changes of this type in practically every 
> Heat release. When we switched from creating resources serially to 
> creating them in parallel in Havana we didn't ask permission. We just 
> did it. We when started allowing users to recover from a failed 
> operation in Juno we didn't ask permission. We just did it. We don't 
> need to ask permission to allow concurrent updates. We can just do it.
> 

The visible change in making things parallel was minimal. In talking
about convergence, it's become clear that users can and should expect
something radically different when they issue stack updates. I'd love to
say that it can be done to just bind convergence into the old ways, but
doing so would also remove the benefit of having it.

Also allowing resume wasn't a new behavior, it was fixing a bug really
(that state was lost on failed operations). Convergence is a pretty
different beast from the current model, and letting users fall back
to the old one means that when things break they can solve their own
problem while the operator and devs figure it out. The operator may know
what is breaking their side, but they may have very little idea of what
is happening on the end-user's side.

> The only difference here is that we are being a bit smarter and 
> uncoupling our development schedule from the release cycle. There are 15 
> other blueprints, essentially all of which have to be complete before 
> convergence is usable at all. It won't do *anything at all* until we are 
> at least 12 blueprints in. The config option buys us time to land them 
> without the risk of something half-finished appearing in the release 
> (trunk-chasers will also thank us). It has no other legitimate purpose IMO.
> 

The config option only really allows an operator to go forward. If
the users start expecting concurrent updates and resiliency, and then
all their stacks are rolled back to the old engine because #reasons,
this puts pressure on the operator. This will make operators delay the
forward progress onto convergence for as long as possible.

I'm also not entirely sure rolling the config option back to the old
setting would even be possible without breaking any in-progress stacks.

> The goal is IN NO WAY to maintain separate code paths in the long term. 
> The config option is simply a development strategy to allow us to land 
> code without screwing up a release and while maintaining as much test 
> coverage as possible.
> 

Nobody plans to maintain the Keystone v2 domainless implementation forever
too. But letting users consider domains and other v3 options for a while
means that the ecosystem grows more naturally without giving up ground
to instability. Once the v3 adoption rate is high enough, people will
likely look at removing the old code because nobody uses it. In my
opinion OpenStack has been far too eager to deprecate and remove things
that users rely on, but I do think this will happen and should happen
eventually.

> > Operators likely won't feel they have enough knowledge to make the call
> > that a heat install should be switched to using all convergence, and
> > users will never be able to try it until the operators do (or the
> > default switches).
> 
> Hardly anyone should have to make a call. We should flip the default as 
> soon as all of the blueprints have landed (i.e. as soon as it works at 
> all), provided that a release is not imminent. (Realistically, at this 
> point I think we have to say the target is to do it as early as in 
> Lizard as we can.) That means for those chasing trunk they get it as 
> soon as it works at all, and for those using stable releases they get it 
> at the next release, just like every other feature we have ever added.
> 
> As a bonus, trunk-chasing operators who need to can temporarily delay 
> enabling of convergence until a point of their choosing in the release 
> cycle by overriding the default. Anybody in that position likely has 
> enough knowledge to make the right call for them.
> 

The end users need a vote too. Operators will certainly know when
convergence based stacks are costing them less than classic stacks. But
they may not know that the new convergence stacks are breaking users. So
rather than an operator flipping the config option and suddenly breaking
users, if they can flip the default, wait a while, and then inform any
users that are doing things "the old way" that it will be turned off some
day, that lets both sides move forward at a pace that makes sense to them.

> So I believe that all of our stakeholders are catered to by the config 
> option: operators & users who want a stable, tested release; 
> operator/users who want to experiment on the bleeding edge; and 
> operators who chase trunk but whose users require stability.
> 
> The only group that benefits from enshrining the choice in the API - 
> users who want to experiment with the bleeding edge, but who don't 
> control their own OpenStack deployment - doesn't actually exist, and if 
> it did then this would be among the least of their problems.
> 

Last I checked there was at least one limited beta of a public Heat
where the users don't control their cloud directly. Also it's likely in
corporate environments that the Heat users won't control the Heat
service directly. But those users may be ready to try concurrent updates
before the feature has fully stabilized.

Anyway, there's one facet you might have missed, which is that by putting
it in the API you serve the users who can't immediately adapt to changes
in the new engine's behavior, and want to use the old behavior.

> > Finally, there are also some benefits to heat developers. Creating a
> > whole new gate job to test convergence-enabled heat will consume its
> > share of CI resource. I'm hoping to make it possible for some of our
> > functional tests to run against a number of scenarios/environments.
> > Being able to run tests under classic and convergence scenarios in one
> > test run will be a great help (for performance profiling too).
> 
> I think this is the strongest argument in favour. However, I'd like to 
> think it would be possible to run the functional tests twice in the 
> gate, changing the config and restarting the engine in between.
> 
> But if the worst comes to the worst, then although I think it's 
> preferable to use one VM for twice as long vs. two VMs for the same 
> length of time, I don't think the impact on resource utilisation in the 
> gate of choosing one over the other is likely to be huge. And I don't 
> see this situation persisting for a long time. The purpose of running 
> both sets of tests is to buy us time to write a migration tool without 
> having to delay flipping the config switch until it is ready. So we'd 
> likely only have to continue running the legacy tests for one release cycle.
> 

The matrix of config settings should just be assumed to be ludicrous
at this point. So adding one more full iteration in the test matrix
is, I think, out of the question. But testing that interface surface
area hasn't diverged in ways that we don't expect is a more tractable
problem. That said, I see this as a temporary improvement, so I don't
really think it is "the reason" to offer the option.



More information about the OpenStack-dev mailing list