[openstack-dev] [nova] Nova compute will delete all your instances if you change its hostname

Matthew Booth mbooth at redhat.com
Fri Feb 27 16:24:36 UTC 2015


Gary Kotton originally posted this bug against the VMware driver:

https://bugs.launchpad.net/nova/+bug/1419785

I posted a proposed patch to fix this here:

https://review.openstack.org/#/c/158269/1

However, Dan Smith pointed out that the bug can actually be triggered
against any driver in a manner not addressed by the above patch alone. I
have confirmed this against a libvirt setup as follows:

1. Create some instances
2. Shutdown n-cpu
3. Change hostname
4. Restart n-cpu

Nova compute will delete all instances in libvirt, but continue to
report them as ACTIVE and Running.

There are 2 parts to this issue:

1. _destroy_evacuated_instances() should do a better job of sanity
checking before performing such a drastic action.

2. The underlying issue is the definition and use of instance.host,
instance.node, compute_node.host and compute_node.hypervisor_hostname.

(1) is belt and braces. It's very important, but I want to focus on (2)
here. Instantly you'll notice some inconsistent naming here, so to clarify:

* instance.host == compute_node.host == Nova compute's 'host' value.
* instance.node == compute_node.hypervisor_hostname == an identifier
which represents a hypervisor.

Architecturally, I'd argue that these mean:

* Host: A Nova communication endpoint for a hypervisor.
* Hypervisor: The physical location of a VM.

Note that in the above case the libvirt driver changed the hypervisor
identifier despite the fact that the hypervisor had not changed, only
its communication endpoint. I propose the following:

* ComputeNode describes 1 hypervisor.
* ComputeNode maps 1 hypervisor to 1 compute host.
* A ComputeNode is identified by a hypervisor_id.
* hypervisor_id represents the physical location of running VMs,
independent of a compute host.

We've renamed compute_node.hypervisor_hostname to
compute_node.hypervisor_id. This resolves some confusion, because it
asserts that the identity of the hypervisor is tied to the data
describing VMs, not the host which is running it. In fact, for the
VMware and Ironic drivers it has never been a hostname.

VMware[1] and Ironic don't require any changes here. Other drivers will
need to be modified so that get_available_nodes() returns a persistent
value rather than just the hostname. A reasonable default implementation
of this would be to write a uuid to a file which lives with VM data and
return its contents. If the hypervisor has a native concept of a
globally unique identifier, that should be used instead.

ComputeNode.hypervisor_id is unique. The hypervisor is unique (there is
physically only 1 of it) so it does not make sense to have multiple
representations of it and its associated resources.

An Instance's location is its hypervisor, whereever that may be, so
Instance.host could be removed. This isn't strictly necessary, but it is
redundant as the communication endpoint is available via ComputeNode. If
we wanted to support the possibility of changing a communication
endpoint at some point, it would also make that operation trivial.
Thinking blue sky, it would also open the future possibility for
multiple communication endpoints for a single hypervisor.

There is a data migration issue associated with changing a driver's
reported hypervisor id. The bug linked below fudges it, but if we were
doing it for all drivers I believe it could be handled efficiently by
passing the instance list already collected by ComputeManager.init_host
to the driver at startup.

My proposed patch above fixes a potentially severe issue for users of
the VMware and Ironic drivers. In conjunction with a move to a
persistent hypervisor id for other drivers, it also fixes the related
issue described above across the board. I would like to go forward with
my proposed fix as it has an immediate benefit, and I'm happy to work on
the persistent hypervisor id for other drivers.

Matt

[1] Modulo bugs: https://review.openstack.org/#/c/159481/
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490



More information about the OpenStack-dev mailing list