[openstack-dev] [Heat] [TripleO] Rolling updates spec re-written. RFC

Clint Byrum clint at fewbar.com
Wed Feb 5 16:39:53 UTC 2014


Excerpts from Zane Bitter's message of 2014-02-04 16:14:09 -0800:
> On 03/02/14 17:09, Clint Byrum wrote:
> > Excerpts from Thomas Herve's message of 2014-02-03 12:46:05 -0800:
> >>> So, I wrote the original rolling updates spec about a year ago, and the
> >>> time has come to get serious about implementation. I went through it and
> >>> basically rewrote the entire thing to reflect the knowledge I have
> >>> gained from a year of working with Heat.
> >>>
> >>> Any and all comments are welcome. I intend to start implementation very
> >>> soon, as this is an important component of the HA story for TripleO:
> >>>
> >>> https://wiki.openstack.org/wiki/Heat/Blueprints/RollingUpdates
> >>
> >> Hi Clint, thanks for pushing this.
> >>
> >> First, I don't think RollingUpdatePattern and CanaryUpdatePattern should be 2 different entities. The second just looks like a parametrization of the first (growth_factor=1?).
> >
> > Perhaps they can just be one. Until I find parameters which would need
> > to mean something different, I'll just use UpdatePattern.
> >
> >>
> >> I then feel that using (abusing?) depends_on for update pattern is a bit weird. Maybe I'm influenced by the CFN design, but the separate UpdatePolicy attribute feels better (although I would probably use a property). I guess my main question is around the meaning of using the update pattern on a server instance. I think I see what you want to do for the group, where child_updating would return a number, but I have no idea what it means for a single resource. Could you detail the operation a bit more in the document?
> >>
> >
> > I would be o-k with adding another keyword. The idea in abusing depends_on
> > is that it changes the core language less. Properties is definitely out
> > for the reasons Christopher brought up, properties is really meant to
> > be for the resource's end target only.
> 
> Agree, -1 for properties - those belong to the resource, and this data 
> belongs to Heat.
> 
> > UpdatePolicy in cfn is a single string, and causes very generic rolling
> 
> Huh?
> 
> http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-attribute-updatepolicy.html
> 
> Not only is it not just a single string (in fact, it looks a lot like 
> the properties you have defined), it's even got another layer of 
> indirection so you can define different types of update policy (rolling 
> vs. canary, anybody?). It's an extremely flexible syntax.
> 

Oops, I relied a little too much on my memory and not enough on docs for
that one. O-k, I will re-evaluate given actual knowledge of how it
actually works. :-P

> BTW, given that we already implemented this in autoscaling, it might be 
> helpful to talk more specifically about what we need to do in addition 
> in order to support the use cases you have in mind.
> 

As Robert mentioned in his mail, autoscaling groups won't allow us to
inject individual credentials. With the ResourceGroup, we can make a
nested stack with a random string generator so that is solved. Now the
other piece we need is to be able to directly choose machines to take
out of commission, which I think we may have a simple solution to but I
don't want to derail on that.

The one used in AutoScalingGroups is also limited to just one group,
thus it can be done all inside the resource.

> > update behavior. I want this resource to be able to control multiple
> > groups as if they are one in some cases (Such as a case where a user
> > has migrated part of an app to a new type of server, but not all.. so
> > they will want to treat the entire aggregate as one rolling update).
> >
> > I'm o-k with overloading it to allow resource references, but I'd like
> > to hear more people take issue with depends_on before I select that
> > course.
> 
> Resource references in general, and depends_on in particular, feel like 
> very much the wrong abstraction to me. This is a policy, not a resource.
> 
> > To answer your question, using it with a server instance allows
> > rolling updates across non-grouped resources. In the example the
> > rolling_update_dbs does this.
> 
> That's not a great example, because one DB server depends on the other, 
> forcing them into updating serially anyway.
> 

You're right, a better example is a set of (n) resource groups which
serve the same service and thus we want to make sure we maintain the
minimum service levels as a whole.

If it were an order of magnitude harder to do it this way, I'd say
sure let's just expand on the single-resource rolling update. But
I think it won't be that much harder to achieve this and then the use
case is solved.

> I have to say that even in general, this whole idea about applying 
> update policies to non-grouped resources doesn't make a whole lot of 
> sense to me. For non-grouped resources you control the resource 
> definitions individually - if you don't want them to update at a 
> particular time, you have the option of just not updating them.
> 

If I have to calculate all the deltas and feed Heat 10 templates, each
with one small delta, I'm writing the same code as I'm proposing for
this rolling update feature, but I'm writing it outside of Heat. That
seems counter-productive for all of the other Heat users who would find
this useful.

> Where you _do_ need it is for scaling groups where every server is based 
> on the same launch config, so you need a way to control the members 
> individually - by batching up operations (done), adding delays (done) 
> or, even better, notifications and callbacks.
> 
> So it seems like doing 'rolling' updates for any random subset of 
> resources is effectively turning Heat into something of a poor-man's 
> workflow service, and IMHO that is probably a mistake.
> 
> What we do need for all resources (not just scaling groups) is a way for 
> the user to say "for this particular resource, notify me when it has 
> updated (but, if possible, before we have taken any destructive actions 
> on it), give me a chance to test it and accept or reject the update". 
> For example, when you resize a server, give the user a chance to confirm 
> or reject the change at the VERIFY_RESIZE step (Trove requires this). Or 
> when you replace a server during an update, give the user a chance to 
> test the new server and either keep it (continue on and delete the old 
> one) or not (roll back). Or when you replace a server in a scaling 
> group, notify the load balancer _or some other thing_ (e.g. OpenShift 
> broker node) that a replacement has been created and wait for it to 
> switch over to the new one before deleting the old one. Or, of course, 
> when you update a server to some new config, give the user a chance to 
> test it out and make sure it works before continuing with the stack 
> update. All of these use cases can, I think, be solved with a single 
> feature.
> 

Yes, this is another thing that we will need for TripleO, as we'll want
to be able to notify a compute node that it will be rebooted soon, and
then have it evacuate or live migrate everything off of itself, and then
call back to Heat "ok reboot me".

> The open questions for me are:
> 1) How do we notify the user that it's time to check on a resource? 
> (Marconi?)

That would be good, and I believe Dmitry from Savanna was looking at
adding AMQP or Marconi to os-collect-config. We don't have to wait for
that though, for a first run we can just update the resource's metadata
to drop a wait condition handle in a pre-selected key. Success would be
"go ahead and reboot me", failure would be "I am in an inconsistent state,
don't touch me."

> 2) How does the user ack/nack? (You're suggesting reusing WaitCondition, 
> and that makes sense to me.)
> 3) How do we break up the operations so the notification occurs at the 
> right time? (With difficulty, but it should be do-able.)
> 4) How does the user indicate for which resources they want to be 
> notified? (Inside an update_policy? Another new directive at the 
> type/properties/depends_on/update_policy level?)

I don't have a good answer for 3 and 4, but I feel like your ()'s are
similar to what I'm thinking too.



More information about the OpenStack-dev mailing list