Open Stack

Wed Jan 22 20:27:50 UTC 2014

On Jan 22, 2014, at 1:53 PM, Jay Pipes wrote:

> On Wed, 2014-01-22 at 13:15 -0500, Dan Prince wrote:
>> 
>> ----- Original Message -----
>>> From: "Clint Byrum" <clint at fewbar.com>
>>> To: "openstack-dev" <openstack-dev at lists.openstack.org>
>>> Sent: Wednesday, January 22, 2014 12:45:45 PM
>>> Subject: Re: [openstack-dev] [TripleO] our update story: can people live	with it?
>>> 
>>> Excerpts from Dan Prince's message of 2014-01-22 09:17:24 -0800:
>>>> I've been thinking a bit more about how TripleO updates are developing
>>>> specifically with regards to compute nodes. What is commonly called the
>>>> "update story" I think.
>>>> 
>>>> As I understand it we expect people to actually have to reboot a compute
>>>> node in the cluster in order to deploy an update. This really worries me
>>>> because it seems like way overkill for such a simple operation. Lets say
>>>> all I need to deploy is a simple change to Nova's libvirt driver. And
>>>> I need to deploy it to *all* my compute instances. Do we really expect
>>>> people to actually have to reboot every single compute node in their
>>>> cluster for such a thing. And then do this again and again for each
>>>> update they deploy?
>>>> 
>>> 
>>> Agreed, if we make everybody reboot to push out a patch to libvirt, we
>>> have failed. And thus far, we are failing to do that, but with good
>>> reason.
>>> 
>>> Right at this very moment, we are leaning on 'rebuild' in Nova, which
>>> reboots the instance. But this is so that we handle the hardest thing
>>> well first (rebooting to have a new kernel).
>>> 
>>> For small updates we need to decouple things a bit more. There is a
>>> notion of the image ID in Nova, versus the image ID that is actually
>>> running. Right now we update it with a nova rebuild command only.
>>> 
>>> But ideally we would give operators a tool to optimize and avoid the
>>> reboot when it is appropriate. The heuristic should be as simple as
>>> comparing kernels.
>> 
>> When we get to implementing such a thing I might prefer it not to be auto-magic. I can see a case where I want the new image but maybe not the new kernel. Perhaps this should be addressed when building the image (by using the older kernel)... but still. I could see a case for explicitly not wanting to reboot here as well.
> 
> ++
> 
>>> Once we have determined that a new image does not
>>> need a reboot, we can just change the ID in Metadata, and an
>>> os-refresh-config script will do something like this:
>>> 
>>> if [ "$(cat /etc/image_id)" != "$(os-apply-config --key image_id)" ] ;
>>> then;
>>>    download_new_image
>>>    mount_image /tmp/new_image
>>>    mount / -o remount,rw # Assuming we've achieved ro root
>>>    rsync --one-file-system -a /tmp/new_image/ /
>>>    mount / -o remount,ro # ditto
>>> fi
>>> 
>>> No reboot required. This would run early in configure.d, so that any
>>> pre-configure.d scripts will have run to quiesce services that can't
>>> handle having their binaries removed out from under them (read:
>>> non-Unix services). Then configure.d runs as usual, configures things,
>>> restarts services, and we are now running the new image.
>> 
>> Cool. I like this a good bit better as it avoids the reboot. Still, this is a rather large amount of data to copy around if I'm only changing a single file in Nova.
> 
> Right.
> 
>>> 
>>>> I understand the whole read only images thing plays into this too... but
>>>> I'm wondering if there is a middle ground where things might work
>>>> better. Perhaps we have a mechanism where we can tar up individual venvs
>>>> from /opt/stack/ or perhaps also this is an area where real OpenStack
>>>> packages could shine. It seems like we could certainly come up with some
>>>> simple mechanisms to deploy these sorts of changes with Heat such that
>>>> compute host reboot can be avoided for each new deploy.
>>> 
>>> Given the scenario above, that would be a further optimization. I don't
>>> think it makes sense to specialize for venvs or openstack services
>>> though, so just "ensure the root filesystems match" seems like a
>>> workable, highly efficient system. Note that we've talked about having
>>> highly efficient ways to widely distribute the new images as well.
>> 
>> Yes. Optimization! In the big scheme of things I could see 3 approaches being useful:
>> 
>> 1) Deploy a full image and reboot if you have a kernel update. (entire image is copied)
>> 
>> 2) Deploy a full image if you change a bunch of things and/or you prefer to do that. (entire image is copied)
>> 
>> 3) Deploy specific application level updates via packages or tarballs. (only selected applications/packages get deployed)
> 
> ++. FWIW, #3 happens a heck of a lot more often than #1 or #2 in CD
> environments, so this level of optimization will be frequently used.
> And, as I've said before, optimizing for frequently-used scenarios is
> worth spending the time on. Optimizing for infrequently-occurring
> things... not so much. :)

I don't understand the aversion to using existing, well-known tools to handle this?

A hybrid model (blending 2 and 3, above) here I think would work best where
TripleO lays down a baseline image and the cloud operator would employ an well-known
and support configuration tool for any small diffs. 

The operator would then be empowered to make the call for any major upgrades that
would adversely impact the infrastructure (and ultimately the users/apps).  He/She
could say, this is a major release, let's deploy the image.

Something logically like this, seems reasonable:

	if (system_change > 10%) {
	  use TripleO;
	  } else {
	  use Existing_Config_Management;
	} 

It seems disruptive to force compute (or other) nodes to reboot on trivial updates.

If we are to get further enterprise adoption of OpenStack, this seems like a huge
blocker.  This will be a very hard sell to get traditional IT folk to buy into
this approach:

	"Wait, *every* time I have to make a system change, I need to reboot my
	 entire cloud?"

Elastic cloud concepts are already trying enough for the enterprise. 

	-k

Open Stack

[openstack-dev] [TripleO] our update story: can people live with it?

OpenStack

Community

Documentation

Branding & Legal