[openstack-dev] [TripleO] our update story: can people live with it?
Clint Byrum
clint at fewbar.com
Wed Jan 22 17:45:45 UTC 2014
Excerpts from Dan Prince's message of 2014-01-22 09:17:24 -0800:
> I've been thinking a bit more about how TripleO updates are developing
> specifically with regards to compute nodes. What is commonly called the
> "update story" I think.
>
> As I understand it we expect people to actually have to reboot a compute
> node in the cluster in order to deploy an update. This really worries me
> because it seems like way overkill for such a simple operation. Lets say
> all I need to deploy is a simple change to Nova's libvirt driver. And
> I need to deploy it to *all* my compute instances. Do we really expect
> people to actually have to reboot every single compute node in their
> cluster for such a thing. And then do this again and again for each
> update they deploy?
>
Agreed, if we make everybody reboot to push out a patch to libvirt, we
have failed. And thus far, we are failing to do that, but with good
reason.
Right at this very moment, we are leaning on 'rebuild' in Nova, which
reboots the instance. But this is so that we handle the hardest thing
well first (rebooting to have a new kernel).
For small updates we need to decouple things a bit more. There is a
notion of the image ID in Nova, versus the image ID that is actually
running. Right now we update it with a nova rebuild command only.
But ideally we would give operators a tool to optimize and avoid the
reboot when it is appropriate. The heuristic should be as simple as
comparing kernels. Once we have determined that a new image does not
need a reboot, we can just change the ID in Metadata, and an
os-refresh-config script will do something like this:
if [ "$(cat /etc/image_id)" != "$(os-apply-config --key image_id)" ] ;
then;
download_new_image
mount_image /tmp/new_image
mount / -o remount,rw # Assuming we've achieved ro root
rsync --one-file-system -a /tmp/new_image/ /
mount / -o remount,ro # ditto
fi
No reboot required. This would run early in configure.d, so that any
pre-configure.d scripts will have run to quiesce services that can't
handle having their binaries removed out from under them (read:
non-Unix services). Then configure.d runs as usual, configures things,
restarts services, and we are now running the new image.
> I understand the whole read only images thing plays into this too... but
> I'm wondering if there is a middle ground where things might work
> better. Perhaps we have a mechanism where we can tar up individual venvs
> from /opt/stack/ or perhaps also this is an area where real OpenStack
> packages could shine. It seems like we could certainly come up with some
> simple mechanisms to deploy these sorts of changes with Heat such that
> compute host reboot can be avoided for each new deploy.
Given the scenario above, that would be a further optimization. I don't
think it makes sense to specialize for venvs or openstack services
though, so just "ensure the root filesystems match" seems like a
workable, highly efficient system. Note that we've talked about having
highly efficient ways to widely distribute the new images as well.
I would call your e-mail a documentation/roadmap bug. This plan may
have been recorded somewhere, but for me it has just always been in my
head as the end goal (thanks to Robert Collins for drilling the hole
and pouring it in there btw ;).
More information about the OpenStack-dev
mailing list