[openstack-dev] [tripleo] Upgrade plans for RDO Manager - Brainstorming

Jan Provaznik jan.provaznik at gmail.com
Thu Sep 17 14:54:01 UTC 2015


On 09/09/2015 05:34 PM, Zane Bitter wrote:
> On 24/08/15 15:12, Emilien Macchi wrote:
>> Hi,
>>
>> So I've been working on OpenStack deployments for 4 years now and so far
>> RDO Manager is the second installer -after SpinalStack [1]- I'm
>> working on.
>>
>> SpinalStack already had interested features [2] that allowed us to
>> upgrade our customer platforms almost every months, with full testing
>> and automation.
>>
>> Now, we have RDO Manager, I would be happy to share my little experience
>> on the topic and help to make it possible in the next cycle.
>>
>> For that, I created an etherpad [3], which is not too long and focused
>> on basic topics for now. This is technical and focused on Infrastructure
>> upgrade automation.
>>
>> Feel free to continue discussion on this thread or directly in the
>> etherpad.
>>
>> [1] http://spinalstack.enovance.com
>> [2] http://spinalstack.enovance.com/en/latest/dev/upgrade.html
>> [3] https://etherpad.openstack.org/p/rdo-manager-upgrades
>
> I added some notes on the etherpad, but I think this discussion poses a
> larger question: what is TripleO? Why are we using Heat? Because to me
> the major benefit of Heat is that it maintains a record of the current
> state of the system that can be used to manage upgrades. And if we're
> not going to make use of that - if we're going to determine the state of
> the system by introspecting nodes and update it by using Ansible scripts
> without Heat's knowledge, then we probably shouldn't be using Heat at all.
>
> I'm not saying that to close off the option - I think if Heat is not the
> best tool for the job then we should definitely consider other options.
> And right now it really is not the best tool for the job. Adopting
> Puppet (which was a necessary choice IMO) has meant that the
> responsibility for what I call "software orchestration"[1] is split
> awkwardly between Puppet and Heat. For example, the Puppet manifests are
> baked in to images on the servers, so Heat doesn't know when they've
> changed and can't retrigger Puppet to update the configuration when they
> do. We're left trying to reverse-engineer what is supposed to be a
> declarative model from the workflow that we want for things like
> updates/upgrades.
>
> That said, I think there's still some cause for optimism: in a world
> where every service is deployed in a container and every container has
> its own Heat SoftwareDeployment, the boundary between Heat's
> responsibilities and Puppet's would be much clearer. The deployment
> could conceivably fit a declarative model much better, and even offer a
> lot of flexibility in which services run on which nodes. We won't really
> know until we try, but it seems distinctly possible to aspire toward
> Heat actually making things easier rather than just not making them too
> much harder. And there is stuff on the long-term roadmap that could be
> really great if only we had time to devote to it - for example, as I
> mentioned in the etherpad, I'd love to get Heat's user hooks integrated
> with Mistral so that we could have fully-automated, highly-available (in
> a hypothetical future HA undercloud) live migration of workloads off
> compute nodes during updates.
>

TBH I don't expect that using containers will significantly simplify (or 
make clearer) the upgrade process. It would work nicely if upgrade would 
mean just replacing one container with another (where a container is 
represented by a heat resource). But I'm convinced that a container 
replacement will actually involve a complex workflow of actions which 
have to be done before and after.

> In the meantime, however, I do think that we have all the tools in Heat
> that we need to cobble together what we need to do. In Liberty, Heat
> supports batched rolling updates of ResourceGroups, so we won't need to
> use user hooks to cobble together poor-man's batched update support any
> more. We can use the user hooks for their intended purpose of notifying
> the client when to live-migrate compute workloads off a server that is

Unfortunately rolling_updates supports only "pause time" between update 
batches, so if any workflow would be needed between batches (e.g. pause 
before next batch until user validates that previous batch update was 
successful), we still have to use user hooks. But I guess adding hooks 
support to rolling_updates wouldn't be too difficult.

> about to upgraded. The Heat templates should already tell us exactly
> which services are running on which nodes. We can trigger particular
> software deployments on a stack update with a parameter value change (as
> we already do with the yum update deployment). For operations that
> happen in isolation on a single server, we can model them as
> SoftwareDeployment resources within the individual server templates. For
> operations that are synchronised across a group of servers (e.g.
> disabling services on the controller nodes in preparation for a DB
> migration) we can model them as a SoftwareDeploymentGroup resource in
> the parent template. And for chaining multiple sequential operations
> (e.g. disable services, migrate database, enable services), we can chain
> outputs to inputs to handle both ordering and triggering. I'm sure there
> will be many subtleties, but I don't think we *need* Ansible in the mix.
>

I agree that both minor and major upgrades *can* be done with existing 
heat features. Other question is how well it works in practice. At this 
point not very well (only my experience), mainly because of these issues:
- (missing) convergence - let's suppose that a minor rolling upgrade on 
60 nodes would take 60 minutes, and during this upgrade I can not do 
another update of the stack (e.g. I can't add more compute nodes). (I 
know convergence is being worked on though)
- it's quite easy to get heat stack into a state from which it's pretty 
difficult to get it back into a consistent state

Your proposal of modeling actions as resources chained by input/output 
(or maybe just depends_on?) sounds like a good plan. Because of my lack 
of heat knowledge I wonder how well this will work in situation where 
combination of both inside-node tasks flow and cross-nodes orchestration 
is required.

Also I'm not sure how it will work in situations when first stack-update 
operation (e.g. package update) fails on some of nodes so stack is in 
FAILED state, then user runs a different stack-update operation (because 
it has higher priority than fixing failed package update - e.g. scaling 
up nodes). I wonder if user will be able to successfully finish scale-up 
update and then get back to package update.

> So it's really up to the wider TripleO project team to decide which path
> to go down. I am genuinely not bothered whether we choose Heat or
> Ansible. There may even be ways they can work together without
> compromising either model. But I would be pretty uncomfortable with a
> mix where we use Heat for deployment and Ansible for doing upgrades
> behind Heat's back.
>

Based on the idea behind TripleO (deploy Openstack by Openstack) I think 
that for TripleO project it makes sense to stick with Heat. Your 
suggestion with using resources fits well into this concept in theory. 
But honestly I think it would be significantly simpler to use an 
external tool ATM :).

> cheers,
> Zane.
>
>
> [1]
> http://www.zerobanana.com/archive/2014/05/08#heat-configuration-management
>

Jan



More information about the OpenStack-dev mailing list