[openstack-dev] [TripleO][Heat] Overcloud software updates and ResourceGroups
Steven Hardy
shardy at redhat.com
Tue Apr 7 09:13:52 UTC 2015
On Thu, Apr 02, 2015 at 06:31:39PM -0400, Zane Bitter wrote:
> A few of us have been looking for a way to perform software updates to
> servers in a TripleO Heat/Puppet-based overcloud that avoids an impedance
> mismatch with Heat concepts and how Heat runs its workflow. As many talented
> TripleO-ers who have gone before can probably testify, that's surprisingly
> difficult to do, but we did come up with an idea that I think might work and
> which I'd like to get wider feedback on. For clarity, I'm speaking here in
> the context of the new overcloud-without-mergepy templates.
>
> The idea is that we create a SoftwareConfig that, when run, can update some
> software on the server. (The exact mechanism for the update is not important
> for this discussion; suffice to say that in principle it could be as simple
> as "[yum|apt-get] update".) The SoftwareConfig would have at least one
> input, though it need not do anything with the value.
>
> Then each server has that config deployed to it with a SoftwareDeployment at
> the time it is created. However, it is set to execute only on the UPDATE
> action. The value of (one of) the input(s) is obtained from a parameter.
>
> As a result, we can trigger the software update by simply changing the value
> of the input parameter, and the regular Heat dependency graph will be
> respected. The actual input value could be by convention a uuid, a
> timestamp, a random string, or just about anything so long as it changes.
>
> Here's a trivial example of what this deployment might look like:
>
> update_config:
> type: OS::Heat::SoftwareConfig
> properties:
> config: {get_file: do_sw_update.sh}
> inputs:
> - name: update_after_time
> description: Timestamp of the most recent update request
>
> update_deployment:
> type: OS::Heat::SoftwareDeployment
> properties:
> actions:
> - UPDATE
> config: {get_resource: update_config}
> server: {get_resource: my_server}
> input_values:
> update_after_time: {get_param: update_timestamp}
>
>
> (A possible future enhancement is that if you keep a mapping between
> previous input values and the system state after the corresponding update,
> you could even automatically handle rollbacks in the event the user decided
> to cancel the update.)
>
> And now we should be able to trigger an update to all of our servers, in the
> regular Heat dependency order, by simply (thanks to the fact that parameters
> now keep their previous values on stack updates unless they're explicitly
> changed) running a command like:
>
> heat stack-update my_overcloud -f $TMPL -P "update_timestamp=$(date)"
>
> (A future goal of Heat is to make specifying the template again optional
> too... I don't think that change landed yet, but in this case we can always
> obtain the template from Tuskar, so it's not so bad.)
>
>
> Astute readers may have noticed that this does not actually solve our
> problem. In reality groups of similar servers are deployed within
> ResourceGroups and there are no dependencies between the members. So, for
> example, all of the controller nodes would be updated in parallel, with the
> likely result that the overcloud could be unavailable for some time even if
> it is deployed with HA.
>
> The good news is that a solution to this problem is already implemented in
> Heat: rolling updates. For example, the controller node availability problem
> can be solved by setting a rolling update batch size of 1. The bad news is
> that rolling updates are implemented only for AutoscalingGroups, not
> ResourceGroups.
>
> Accordingly, I propose that we switch the implementation of
> overcloud-without-mergepy from ResourceGroups to AutoscalingGroups. This
> would be a breaking change for overcloud updates (although no worse than the
> change from merge.py over to overcloud-without-mergepy), but that also means
> that there'll never be a better time than now to make it.
I wonder if this is an opportunity to look at how we converge
AutoScalingGroup and ResourceGroup in Heat?
It seems like the main barrier to transparent (non destructive)
substitution of
AutoScalingGroup for ResourceGroup is the resource naming (e.g it's a short
UUID vs an index derived name) - could we add a property to
AutoScalingGroup which allowed optionally to use index based naming, such
that switching from ResourceGroup to ASG in a stack-update wouldn't replace
all the group members?
Another possible fudge if moving to ASG is impractical could be to use the
index in the script applying the update, such that an offset is introduced
between any updates which may cause service interruption (I know it's a
kludge, but e.g sleeping for a time derived from the group index before
doing the update would be an ugly-but-simple interim workaround for the
"all updated at once" problem you describe).
> I suspect that some folks (Tomas?) have possibly looked into this in the
> past... can anybody identify any potential obstacles to the change? Two
> candidates come to mind:
>
> 1) The SoftwareDeployments (plural) resource type. I believe we carefully
> designed that to work with both ResourceGroup and AutoscalingGroup though.
> 2) The elision feature (https://review.openstack.org/#/c/128365/). Steve, I
> think this was only implemented for ResourceGroup? An AutoscalingGroup
> version of this should be feasible though, or do we have better ideas for
> how to solve it in that context?
Yeah, I started looking at an alternative interface to achieve the same
thing, basically by enabling heat resource-signal to signal an
AutoScalingGroup directly (instead of only via a ScalingPolicy).
I need to revive this patch:
https://review.openstack.org/#/c/143496/
Then we'd need another patch adding support for a new signal payload which
specifies a specific resource for removal. This is related to the
"replace" interface I described in this thread, only a "remove" variant
which would enable a functional replacement fot the ResourceGroup
removal_policies+resource_list interface we currently provide.
http://lists.openstack.org/pipermail/openstack-dev/2014-December/053447.html
The other gap which occurs to me is that the AutoScalingGroup
implementation doesn't support the index_var feature of ResourceGroup, and
I know that some folks are expecting to use that in TripleO, e.g:
https://review.openstack.org/#/c/169937/
Again, it seems like figuring out a series of incremental steps to stop the
divergence of the two group resources is what we really need, such that in
future folks don't get pushed into an either/or choice which results in a
later breaking change.
Steve
More information about the OpenStack-dev
mailing list