[openstack-dev] [TripleO][Heat][Kolla][Magnum] The zen of Heat, containers, and the future of TripleO

Zane Bitter zbitter at redhat.com
Mon Mar 21 20:14:20 UTC 2016

tl;dr Containers represent a massive, and also mandatory, opportunity 
for TripleO. Lets start thinking about ways that we can take maximum 
advantage to achieve the goals of the project.

Now that you have the tl;dr I'm going to start from the beginning, so 
settle in and grab yourself a cup of coffee or other poison of your choice.

After working on developing Heat from the very beginning of the project 
in early 2012 and debugging a bunch of TripleO deployments in the field, 
it is my considered opinion that Heat is a poor fit for the workloads 
that TripleO is currently asking of it. To illustrate why, I need to 
explain what it is that Heat is really designed to do.

Here's a theoretical example of how I've always imagined Heat software 
deployments would make Heat users' lives better. For simplicity, I'm 
just going to model two software components, a user-facing service that 
connects to some back-end service:

       type: OS::Heat::SoftwareComponent
           - tool: script
               - CREATE
               - UPDATE
             config: |
               PORT=$(get_backend_port || random_port)
               start_backend $DEPLOY_VERSION $PORT $CONFIG
               printf '%s' "$addr" >${heat_outputs_path}.host_and_port
           - tool: script
               - DELETE
             config: |
            - name: DEPLOY_VERSION
            - name: CONFIG
            - name: host_and_port

       type: OS::Heat::SoftwareComponent
           - tool: script
               - CREATE
               - UPDATE
             config: |
               start_frontend $DEPLOY_VERSION $BACKEND_ADDR $CONFIG
           - tool: script
               - DELETE
             config: |
           - name: DEPLOY_VERSION
           - name: BACKEND_ADDR
           - name: CONFIG

       type: OS::Heat::SoftwareDeployment
         server: {get_resource: backend_server}
         name: {get_param: backend_version} # Forces upgrade replacement
         actions: [CREATE, UPDATE, DELETE]
         config: {get_resource: backend_component}
           DEPLOY_VERSION: ${get_param: backend_version}
           CONFIG: ${get_param: backend_config}

       type: OS::Heat::SoftwareDeployment
         server: {get_resource: frontend_server}
         name: {get_param: frontend_version} # Forces upgrade replacement
         actions: [CREATE, UPDATE, DELETE]
         config: {get_resource: frontend_component}
           DEPLOY_VERSION: ${get_param: frontend_version}
           BACKEND_ADDR: {get_attr: [backend, host_and_port]}
           CONFIG: ${get_param: frontend_config}

This is actually quite a beautiful system, if I may say so:

- Whenever a version changes, Heat knows to update that component, and 
the components can be updated independently.
- If the backend in this example restarts on a different port, the 
frontend is updated to point to the new port.
- Everything is completely agnostic as to which server it is running on. 
They could be running on the same server or different servers.
- Everything is integrated with the infrastructure (not only the servers 
you're deploying on and the networks and volumes connected to them, but 
also things like load balancers), so everything is created at the right 
time, in parallel where possible, and any errors are reported all in one 
- If something requires e.g. a restart after changing another component, 
we can encode that. And if it doesn't, we can encode that too.
- There's next to no downtime required: if e.g. we upgrade the backend, 
we first deploy a new one listening on a new port, then update the 
frontend to listen on the new port, then finally shut down the old 
backend. Again, we can choose when we want this and when we just want to 
update in place and reload.
- The application doesn't even need to worry about versioning the 
protocol that its two constituent parts communicate over: as long as the 
backend_version and frontend_version that we pass are always compatible, 
only compatible versions of the two services ever talk to each other.
- If anything at all fails at any point before, during or after this 
part of the template, Heat can automatically roll everything back into 
the exact same state as it was in before, without any outside 
intervention. You can insert test deployments that check everything is 
working and have them automatically roll back if it's not, all with no 
downtime for users.

So you can use this to do something like a fancier version of blue-green 
deployment,[1] where you're actually rolling out the (virtualised) 
hardware and infrastructure in a blue-green fashion along with the 
software. Not only that, you can choose to replace your whole stack or 
only parts of it. (Note: the way I had to encode this in the example 
above, by changing the deployment name so that it forces a resource 
replacement, is a hack. We really need a feature to specify in a 
software config resource which inputs should result in a replacement on 

It's worth noting that in practice you really, really want everything 
deployed in containers to make this process work consistently, even 
though *in theory* you could make this work (briefly) without them. In 
particular, rollback without containers is a dicey proposition. When we 
first started talking about implementing software deployments in Heat I 
half-seriously suggested that maybe we should make containers the only 
allowed type of software deployment, and I kind of wonder now if I 
shouldn't have pressed harder on that point.

In any event, unfortunately as everyone involved in TripleO knows, the 
way TripleO uses Heat looks nothing like this. It actually looks more 
like this:

       type: OS::Heat::SoftwareConfig
         actions: [CREATE]
         config: {get_file: install_all_the_things_on_one_server.sh}

       type: OS::Heat::SoftwareConfig
         actions: [UPDATE]
         config: {get_file: update_all_the_things_on_one_server.sh}
           - name: update_count


(Filling in the rest is left as an exercise to the reader. You're welcome.)

Not illustrated are the multiple sources of truth that we have: puppet 
modules (packaged on the server), puppet manifests and hieradata 
(delivered via Heat), external package repositories. Heat is a dataflow 
language but much of the data it should be operating on is actually 
hidden from it. That's going about as well as you might expect.

Due to the impossibility of ever rolling back a deployment like one of 
those, we just disable rollback for the overcloud templates, so if 
there's a failure we end up stuck in whatever intermediate state we were 
in when the script died. That can leave things in an state where 
recovery is not automatic when 'earlier' deployments (like the package 
update) end up depending on state set up by 'later' deployments (like 
the post- scripts, which manipulate Pacemaker's state in Pacemaker-based 
deployments). Even worse, many of the current scripts leave the machine 
in a state that requires manual recovery should they fail part-way through.

Indeed, this has literally none of the benefits of the ideal Heat 
deployment enumerated above save one: it may be entirely the wrong tool 
in every way for the job it's being asked to do, but at least it is 
still well-integrated with the rest of the infrastructure.

Now, at the Mitaka summit we discussed the idea of a 'split stack', 
where we have one stack for the infrastructure and a separate one for 
the software deployments, so that there is no longer any tight 
integration between infrastructure and software. Although it makes me a 
bit sad in some ways, I can certainly appreciate the merits of the idea 
as well. However, from the argument above we can deduce that if this is 
the *only* thing we do then we will end up in the very worst of all 
possible worlds: the wrong tool for the job, poorly integrated. Every 
single advantage of using Heat to deploy software will have evaporated, 
leaving only disadvantages.

So what would be a good alternative? And how would we evaluate the options?

To my mind, the purpose of the TripleO project is this: to ensure that 
there is an OpenStack community collaborating around each part of the 
OpenStack installation/management story. We don't care about TripleO 
"owning" that part (all things being equal, we'd prefer not to), just 
that nobody should have to go outside the OpenStack community and/or 
roll their own thing to install OpenStack unless they want to. So I 
think the ability to sustain a community around whatever solution we 
choose ought to be a primary consideration.

The use of Ironic has been something of a success story here. There's 
only one place to add hardware support to enable both installing 
OpenStack itself on bare-metal via TripleO and the 'regular' 
bare-metal-to-tenant use case of Ironic. This is a clear win/win.

Beyond getting the bare-metal machines marshalled, the other part of the 
solution is configuration management and orchestration of the various 
software services. When TripleO started there was nowhere in OpenStack 
that was defining the relationships between services needed to 
orchestrate them. To a large extent there still isn't. I think that one 
of the reasons we adopted Puppet in TripleO was that it was supposed to 
provide this, at least within a limited scope (i.e. on one machine - the 
puppet-deploying community is largely using Ansible to orchestrate 
across boxes, and we are using Heat). However, what we've discovered in 
the past few months is that Puppet is actually not able to fulfil this 
role as long as we support Pacemaker-based deployments as an option, 
because in that case Pacemaker actually has control of starting and 
stopping all of the services. As a result we are back to defining it all 
ourselves in the Pacemaker config plus various hacky shell scripts, 
instead of relying on (and contributing to!) a larger community. Even 
ignoring that, Puppet doesn't solve the problem of orchestrating across 
multiple machines.

Clearly one option would be to encode everything in Heat along the lines 
of the first example above. I think once we have containers this could 
actually work really well for compute nodes and other types of scale-out 
nodes (e.g. Swift nodes). The scale-out model of Heat scaling groups 
works really well for this use case, and between the improvements we 
have put in place (like batched updates and user hooks) and those still 
on the agenda (like notifications + automatic Mistral workflow 
triggering on hooks) Heat could provide a really good way of capturing 
things like migrating user workloads on scale down and rolling updates 
in the templates, so that they can be managed completely automatically 
by the undercloud with no client involvement (and when the undercloud 
becomes HA, they'll get HA for free). I'd be pretty excited to see this 
tried. The potential downside is that the orchestration definitions are 
still trapped inside the TripleO templates, so they're not being shared 
outside of the TripleO community. This is probably justified though 
owing to its close ties to the underlying infrastructure.

An alternative out of left field: as far as I can gather the "completely 
new way of orchestrating activities" used by the new Puppet Application 
Orchestration thing[2] uses substantially the same model as I described 
for Heat above. If we added Puppet Application Orchestration data to 
openstack-puppet-modules then it may be possible to write a tool to 
generate Heat templates from that data. However in talking with Emilien 
it sounds like o-p-m is quite some time away from tackling PAO. So I 
don't think this is really feasible.

In any event, it's when we get to the controller nodes that the 
downsides become more pronounced. We're no longer talking about one 
deployment per service like I sketched above; each service is actually 
multiple deployments forming an active-active cluster with virtual IPs 
and failover and all that jazz. It may be that everything would just 
work the same way, but we would be in uncharted territory and there 
would likely be unanticipated subtleties. It's particularly unclear how 
we would handle stop-the-world database migrations in this model, 
although we do have the option of hoping that stop-the-world database 
migrations will have been completely phased out by then.

To make it even more complicated, we ultimately want the services to 
heterogeneously spread among controller nodes in a configurable way. I 
believe that Dan's work on composable roles has already gone some way 
toward this without even using containers, but it's likely to become 
increasingly difficult to model in Heat without some sort of template 
generation. (I personally think that template generation would be a Good 
Thing, but we've chosen not to go down that path so far.) Quite possibly 
even just having composable roles could make it untenable to continue 
maintaining separate Pacemaker and non-Pacemaker deployment modes. It'd 
be really nice to have the flexibility to do things like scale out 
different services at different rates. What's more, we are going to need 
some way of redistributing services when a machine in the cluster fails, 
and ultimately we would like that process to be automated, which would 
*require* a template generation service.

We certainly *could* build all of that. But we definitely shouldn't 
because this is the kind of thing that services like Kubernetes and 
Apache Mesos are designed to do already. And that raises another 
possibility: Angus & friends are working on capturing the orchestration 
relationships for Mesos+Marathon within the Kolla project (specifically, 
in the kolla-mesos repository). This represents a tremendous opportunity 
for the TripleO project to further its mission of having the same 
deployment tools available to everyone as an official part of the 
OpenStack project without having to maintain them separately.

As of the Liberty release, Magnum now supports provisioning Mesos 
clusters, so TripleO wouldn't have to maintain the installer for that 
either. (The choice of Mesos is somewhat unfortunate in our case, 
because Magnum's Kubernetes support is much more mature than its Mesos 
support, and because the reasons for the decision are about to be or 
have already been overtaken by events - I've heard reports that the 
features that Kubernetes was missing to allow it to be used for 
controller nodes, and maybe even compute nodes, are now available. 
Nonetheless, I expect the level of Magnum support for Mesos is likely 
workable.) This is where the TripleO strategy of using OpenStack to 
deploy OpenStack can really pay dividends: because we use Ironic all of 
our servers are accessible through the Nova API, so in theory we can 
just run Magnum out of the box.

The chances of me personally having time to prototype this are 
slim-to-zero, but I think this is a path worth investigating.


[1] http://martinfowler.com/bliki/BlueGreenDeployment.html
[2] https://puppetlabs.com/introducing-puppet-application-orchestration

More information about the OpenStack-dev mailing list