[openstack-dev] [TripleO][Heat] Overcloud software updates and ResourceGroups

Zane Bitter zbitter at redhat.com
Thu Apr 2 22:31:39 UTC 2015

A few of us have been looking for a way to perform software updates to 
servers in a TripleO Heat/Puppet-based overcloud that avoids an 
impedance mismatch with Heat concepts and how Heat runs its workflow. As 
many talented TripleO-ers who have gone before can probably testify, 
that's surprisingly difficult to do, but we did come up with an idea 
that I think might work and which I'd like to get wider feedback on. For 
clarity, I'm speaking here in the context of the new 
overcloud-without-mergepy templates.

The idea is that we create a SoftwareConfig that, when run, can update 
some software on the server. (The exact mechanism for the update is not 
important for this discussion; suffice to say that in principle it could 
be as simple as "[yum|apt-get] update".) The SoftwareConfig would have 
at least one input, though it need not do anything with the value.

Then each server has that config deployed to it with a 
SoftwareDeployment at the time it is created. However, it is set to 
execute only on the UPDATE action. The value of (one of) the input(s) is 
obtained from a parameter.

As a result, we can trigger the software update by simply changing the 
value of the input parameter, and the regular Heat dependency graph will 
be respected. The actual input value could be by convention a uuid, a 
timestamp, a random string, or just about anything so long as it changes.

Here's a trivial example of what this deployment might look like:

     type: OS::Heat::SoftwareConfig
       config: {get_file: do_sw_update.sh}
         - name: update_after_time
           description: Timestamp of the most recent update request

     type: OS::Heat::SoftwareDeployment
         - UPDATE
       config: {get_resource: update_config}
       server: {get_resource: my_server}
         update_after_time: {get_param: update_timestamp}

(A possible future enhancement is that if you keep a mapping between 
previous input values and the system state after the corresponding 
update, you could even automatically handle rollbacks in the event the 
user decided to cancel the update.)

And now we should be able to trigger an update to all of our servers, in 
the regular Heat dependency order, by simply (thanks to the fact that 
parameters now keep their previous values on stack updates unless 
they're explicitly changed) running a command like:

   heat stack-update my_overcloud -f $TMPL -P "update_timestamp=$(date)"

(A future goal of Heat is to make specifying the template again optional 
too... I don't think that change landed yet, but in this case we can 
always obtain the template from Tuskar, so it's not so bad.)

Astute readers may have noticed that this does not actually solve our 
problem. In reality groups of similar servers are deployed within 
ResourceGroups and there are no dependencies between the members. So, 
for example, all of the controller nodes would be updated in parallel, 
with the likely result that the overcloud could be unavailable for some 
time even if it is deployed with HA.

The good news is that a solution to this problem is already implemented 
in Heat: rolling updates. For example, the controller node availability 
problem can be solved by setting a rolling update batch size of 1. The 
bad news is that rolling updates are implemented only for 
AutoscalingGroups, not ResourceGroups.

Accordingly, I propose that we switch the implementation of 
overcloud-without-mergepy from ResourceGroups to AutoscalingGroups. This 
would be a breaking change for overcloud updates (although no worse than 
the change from merge.py over to overcloud-without-mergepy), but that 
also means that there'll never be a better time than now to make it.

I suspect that some folks (Tomas?) have possibly looked into this in the 
past... can anybody identify any potential obstacles to the change? Two 
candidates come to mind:

1) The SoftwareDeployments (plural) resource type. I believe we 
carefully designed that to work with both ResourceGroup and 
AutoscalingGroup though.
2) The elision feature (https://review.openstack.org/#/c/128365/). 
Steve, I think this was only implemented for ResourceGroup? An 
AutoscalingGroup version of this should be feasible though, or do we 
have better ideas for how to solve it in that context?


More information about the OpenStack-dev mailing list