[openstack-dev] [heat][tripleo] User Initiated Rollback

Steve Baker sbaker at redhat.com
Thu Dec 3 21:22:39 UTC 2015


On 04/12/15 03:41, Steven Hardy wrote:
> On Thu, Dec 03, 2015 at 08:11:41AM -0500, Dan Prince wrote:
>> On Wed, 2015-12-02 at 16:02 +0000, Steven Hardy wrote:
>>> So, chatting with Giulio today about https://bugs.launchpad.net/heat/
>>> +bug/1521944
>>> has be thinking about $subject.
>>>
>>> The root case of that issue is essentially a corner case of a stack-
>>> update,
>>> combined with some coupling within the Neutron API which prevents the
>>> update traversal from working.
>>>
>>> But it raises the broader question of what a "rollback" actually is,
>>> and
>>> how a user can potentially use it to get out of the kind of mess
>>> described
>>> in that bug (where, otherwise, your only option is to delete the
>>> entire
>>> stack).
>>>
>>> Currently, we treat rollback as a special type of update, where, if
>>> an
>>> in-progress update fails, we then try to update again, to the
>>> previous
>>> stack definition[1], but as Giulio has discovered, there are times
>>> when
>>> that doesn't work, because what you actually want is to recover the
>>> existing resource from the backup stack, not create a new one with
>>> the same
>>> properties.
>> Is there more information about this case (a bug perhaps)? Presumably
>> it is an OpenStack resource you are talking about here... like a Nova
>> Server or Neutron Network Port?
> Well the bug is linked above (1521944), but there's no bug specific to
> rollback.
>
> As Zane has pointed out, heat is actually working as desired here, because
> we aren't able to differentiate an attempt to delete a neutron port which
> results in "not allowed, in use" with "500, I am broken".
>
> I was hoping there was some way to make this easier via rollback, but
> increasingly it seems the solution is not to tell Heat to do the wrong
> thing (which is the root cause of this issue).
>
> There are a few ways we can do that:
>
> 1. Stop defining default "noop" resources in
> overcloud-resource-registry-puppet.yaml - it makes it too easy to
> accidentally switch to a noop (destructive) implementation on update.
Splitting out the noop stubs into their own environment that only gets 
included on overcloud create would certainly lower the risk of 
customizations being overwritten by stubs. We would just need a strategy 
for when new types are added that need to be stubbed by default.
> 2. Improve heat stack update preview, so it handles nested stacks, then we
> can easily have a pre-update validation step, which for example checks (and
> warns, loudly) if any resources will be deleted (particularly network and
> server resources..)  I'm working on this ref:
>
> https://bugs.launchpad.net/heat/+bug/1521971
We should definitely do this once pre-update works for nested stacks. 
tripleoclient could have a whitelist of resource types which generally 
shouldn't be replaced (subnets, ports, servers) and prompt the user with 
a list of resource which will be replaced and a N/y question to continue.

> 3. Implement a template annotation which allows you to say "don't update"
> for certain resources, such as servers and network ports etc.  Rabi is
> working on this, here's the (old) BP which didn't get implemented but I
> think will help us:
>
> https://github.com/openstack/heat-specs/blob/master/specs/kilo/stack-update-restrict.rst
Yes, a way of declaring a resource as not replaceable would also 
increase safety (in-place updates should be fine though)

>>> Then, looking at convergence, we have a different definition of
>>> rollback,
>>> it's not yet clear to me how this should behave in a similar
>>> scenario, e.g
>>> when the resource we want to roll back to failed to get deleted but
>>> still
>>> exists (so, the resource is FAILED, but the underlying resource is
>>> fine)?
>>>
>>> Finally, the interface to rollback - atm you have to know before
>>> something
>>> fails that you'd like to enable rollback for a specific update.  This
>>> seems
>>> suboptimal, since invariably by the time you know you need rollback,
>>> it's
>>> too late.  Can we enable a user-initiated rollback from a FAILED
>>> state, via
>>> one of:
>>>
>>>   - Introduce a new heat API that allows an explicit heat stack-
>>> rollback?
>>>   - (ab)use PATCH to trigger rollback on heat stack-update -x --
>>> rollback=True?
>>>
>>> The former approach fits better with the current stack.Stack
>>> implementation, because the ROLLBACK stack state already exists.  The
>>> latter has the advantage that it doesn't need a new API so might be
>>> backportable.
>>>
>>> Any thoughts on how we might proceed to make this situation better,
>>> and
>>> enable folks to roll back in the least destructive way possible when
>>> they
>>> end up in a FAILED state?
>>  From a TripleO standpoint I would really like to end up in a place
>> where we aren't thinking of Heat as a rollback tool and more of a make
>> it so tool. I think there might be a small case for the
>> "infrastructure" side where Heat is creating OpenStack objects for us
>> (servers and ports). We'd like not to destroy/replace these when we
>> update the "infrastructure" pieces of our stack and if things go badly
>> on an update you just want to stay in the (hopefully still working)
>> previous state.
> Yeah, keeping the infrastructure and software configuration more cleanly
> separated will help, but we still need much better pre-update validation.
>
>> On the configuration (currently software deployments driving puppet) I
>> would very much like to have Heat be a make-it so tool that does what
>> we tell it. If I wanted to roll back the configuration I would prefer
>> to simply do another heat stack-update with the previous
>> parameters/manifests/etc. Or perhaps more drastically, delete the
>> entire configuration stack and heat stack-create with the previous one.
>> Puppet is meant to be idempotent so re-running a previously working
>> manifests might be just what you want. This wouldn't cover all cases
>> for rollback... and there are certainly things where you'd want a
>> custom ad-hoc puppet snippet or bash script to run before you did a
>> follow up heat stack-update to put things back like they were. For
>> these cases I think perhaps workflow tools to perhaps help drive our
>> Heat configuration orchestration could work well.
> Yeah, this already works fine for softwareconfig stuff, just not some of
> the infrastructure pieces, such as network/subnet/port, where more care is
> required to avoid doing the wrong thing.
>
> Cheers,
>
> Steve
>
>> Dan
>>
>>> Steve
>>>
>>> [1] https://github.com/openstack/heat/blob/master/heat/engine/stack.p
>>> y#L1331
>>> [2] https://github.com/openstack/heat/blob/master/heat/engine/stack.p
>>> y#L1143
>>>
>>> _____________________________________________________________________
>>> _____
>>> OpenStack Development Mailing List (not for usage questions)
>>> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubs
>>> cribe
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>> __________________________________________________________________________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




More information about the OpenStack-dev mailing list