[openstack-dev] [TripleO][Heat] Overcloud software updates and ResourceGroups

Zane Bitter zbitter at redhat.com
Tue Apr 7 23:12:42 UTC 2015

On 07/04/15 05:13, Steven Hardy wrote:
> On Thu, Apr 02, 2015 at 06:31:39PM -0400, Zane Bitter wrote:
>> A few of us have been looking for a way to perform software updates to
>> servers in a TripleO Heat/Puppet-based overcloud that avoids an impedance
>> mismatch with Heat concepts and how Heat runs its workflow. As many talented
>> TripleO-ers who have gone before can probably testify, that's surprisingly
>> difficult to do, but we did come up with an idea that I think might work and
>> which I'd like to get wider feedback on. For clarity, I'm speaking here in
>> the context of the new overcloud-without-mergepy templates.
>> The idea is that we create a SoftwareConfig that, when run, can update some
>> software on the server. (The exact mechanism for the update is not important
>> for this discussion; suffice to say that in principle it could be as simple
>> as "[yum|apt-get] update".) The SoftwareConfig would have at least one
>> input, though it need not do anything with the value.
>> Then each server has that config deployed to it with a SoftwareDeployment at
>> the time it is created. However, it is set to execute only on the UPDATE
>> action. The value of (one of) the input(s) is obtained from a parameter.
>> As a result, we can trigger the software update by simply changing the value
>> of the input parameter, and the regular Heat dependency graph will be
>> respected. The actual input value could be by convention a uuid, a
>> timestamp, a random string, or just about anything so long as it changes.
>> Here's a trivial example of what this deployment might look like:
>>    update_config:
>>      type: OS::Heat::SoftwareConfig
>>      properties:
>>        config: {get_file: do_sw_update.sh}
>>        inputs:
>>          - name: update_after_time
>>            description: Timestamp of the most recent update request
>>    update_deployment:
>>      type: OS::Heat::SoftwareDeployment
>>      properties:
>>        actions:
>>          - UPDATE
>>        config: {get_resource: update_config}
>>        server: {get_resource: my_server}
>>        input_values:
>>          update_after_time: {get_param: update_timestamp}
>> (A possible future enhancement is that if you keep a mapping between
>> previous input values and the system state after the corresponding update,
>> you could even automatically handle rollbacks in the event the user decided
>> to cancel the update.)
>> And now we should be able to trigger an update to all of our servers, in the
>> regular Heat dependency order, by simply (thanks to the fact that parameters
>> now keep their previous values on stack updates unless they're explicitly
>> changed) running a command like:
>>    heat stack-update my_overcloud -f $TMPL -P "update_timestamp=$(date)"
>> (A future goal of Heat is to make specifying the template again optional
>> too... I don't think that change landed yet, but in this case we can always
>> obtain the template from Tuskar, so it's not so bad.)
>> Astute readers may have noticed that this does not actually solve our
>> problem. In reality groups of similar servers are deployed within
>> ResourceGroups and there are no dependencies between the members. So, for
>> example, all of the controller nodes would be updated in parallel, with the
>> likely result that the overcloud could be unavailable for some time even if
>> it is deployed with HA.
>> The good news is that a solution to this problem is already implemented in
>> Heat: rolling updates. For example, the controller node availability problem
>> can be solved by setting a rolling update batch size of 1. The bad news is
>> that rolling updates are implemented only for AutoscalingGroups, not
>> ResourceGroups.
>> Accordingly, I propose that we switch the implementation of
>> overcloud-without-mergepy from ResourceGroups to AutoscalingGroups. This
>> would be a breaking change for overcloud updates (although no worse than the
>> change from merge.py over to overcloud-without-mergepy), but that also means
>> that there'll never be a better time than now to make it.
> I wonder if this is an opportunity to look at how we converge
> AutoScalingGroup and ResourceGroup in Heat?

As long as it's not one of those insoluble opportunities.

> It seems like the main barrier to transparent (non destructive)
> substitution of
> AutoScalingGroup for ResourceGroup is the resource naming (e.g it's a short
> UUID vs an index derived name) - could we add a property to
> AutoScalingGroup which allowed optionally to use index based naming, such
> that switching from ResourceGroup to ASG in a stack-update wouldn't replace
> all the group members?

I would say the main barrier is that you can't ever change a resource's 
type without replacing it, and even the hacky workaround we have 
(abandon/adopt) is not robust enough to actually use. Resource naming 
doesn't even make the list - AutoscalingGroup doesn't care how its 
members are named and always preserves existing names.

> Another possible fudge if moving to ASG is impractical could be to use the
> index in the script applying the update, such that an offset is introduced
> between any updates which may cause service interruption (I know it's a
> kludge, but e.g sleeping for a time derived from the group index before
> doing the update would be an ugly-but-simple interim workaround for the
> "all updated at once" problem you describe).

If it comes to that I'd infinitely prefer setting pre-update hooks on 
all the resources and control it from an external workflow. We have to 
implement that anyway for other things (e.g. a phased initial 
deployment). It'd just be nicer if Heat could handle that part for us.

>> I suspect that some folks (Tomas?) have possibly looked into this in the
>> past... can anybody identify any potential obstacles to the change? Two
>> candidates come to mind:
>> 1) The SoftwareDeployments (plural) resource type. I believe we carefully
>> designed that to work with both ResourceGroup and AutoscalingGroup though.
>> 2) The elision feature (https://review.openstack.org/#/c/128365/). Steve, I
>> think this was only implemented for ResourceGroup? An AutoscalingGroup
>> version of this should be feasible though, or do we have better ideas for
>> how to solve it in that context?
> Yeah, I started looking at an alternative interface to achieve the same
> thing, basically by enabling heat resource-signal to signal an
> AutoScalingGroup directly (instead of only via a ScalingPolicy).
> I need to revive this patch:
> https://review.openstack.org/#/c/143496/
> Then we'd need another patch adding support for a new signal payload which
> specifies a specific resource for removal.  This is related to the
> "replace" interface I described in this thread, only a "remove" variant
> which would enable a functional replacement fot the ResourceGroup
> removal_policies+resource_list interface we currently provide.
> http://lists.openstack.org/pipermail/openstack-dev/2014-December/053447.html

Cool, I like this approach of passing data in a signal to remove/replace 
a member. Wearing my downstream hat for a second, if we can make it 
completely backwards compatible then I'd go so far as to say that 
carrying it as a patch in RDO for Kilo, while distasteful, is probably 
less distasteful to me than continuing with ResourceGroup.

> The other gap which occurs to me is that the AutoScalingGroup
> implementation doesn't support the index_var feature of ResourceGroup, and
> I know that some folks are expecting to use that in TripleO, e.g:
> https://review.openstack.org/#/c/169937/

I don't have enough context around that patch to know how it relates to 
the index_var feature, and I don't see that feature being used anywhere 
in tripleo-heat-templates.

What I do know is that anything that relies on an incrementing integer 
index assigned by Heat is likely to run into problems eventually, 
because the elision feature will inevitably cause the list to become 
increasingly sparse over time. So we should probably pick another approach.

> Again, it seems like figuring out a series of incremental steps to stop the
> divergence of the two group resources is what we really need, such that in
> future folks don't get pushed into an either/or choice which results in a
> later breaking change.

I agree that Heat needs to stop maintaining two different 
implementations of the exact same idea and adding new features to one or 
the other, such that there's never one resource type that does all of 
the things you need to do. However, it's not clear to me that there's a 
path to seamlessly migrate both types to the same implementation 
encompassing all of the features without breaking existing users.

So are you saying that you are -1 on converting the 
overcloud-without-mergepy templates to AutoscalingGroup?

I haven't yet heard anyone say that a pre-Kilo breaking change like this 
to overcloud-without-mergepy.yaml would be a problem for them...


More information about the OpenStack-dev mailing list