[openstack-dev] [nova] theoretical race between live migration and resource audit?
Chris Friesen
chris.friesen at windriver.com
Thu Jun 9 21:41:20 UTC 2016
Hi,
I'm wondering if we might have a race between live migration and the resource
audit. I've included a few people on the receiver list that have worked
directly with this code in the past.
In _update_available_resource() we have code that looks like this:
instances = objects.InstanceList.get_by_host_and_node()
self._update_usage_from_instances()
migrations = objects.MigrationList.get_in_progress_by_host_and_node()
self._update_usage_from_migrations()
In post_live_migration_at_destination() we do this (updating the host and node
as well as the task state):
instance.host = self.host
instance.task_state = None
instance.node = node_name
instance.save(expected_task_state=task_states.MIGRATING)
And in _post_live_migration() we update the migration status to "completed":
if migrate_data and migrate_data.get('migration'):
migrate_data['migration'].status = 'completed'
migrate_data['migration'].save()
Both of the latter routines are not serialized by the
COMPUTE_RESOURCE_SEMAPHORE, so they can race relative to the code in
_update_available_resource().
I'm wondering if we can have a situation like this:
1) migration in progress
2) We start running _update_available_resource() on destination, and we call
instances = objects.InstanceList.get_by_host_and_node(). This will not return
the migration, because it is not yet on the destination host.
3) The migration completes and we call post_live_migration_at_destination(),
which sets the host/node/task_state on the instance.
4) In _update_available_resource() on destination, we call migrations =
objects.MigrationList.get_in_progress_by_host_and_node(). This will return the
migration for the instance in question, but when we run
self._update_usage_from_migrations() the uuid will not be in "instances" and so
we will use the instance from the newly-queried migration. We will then ignore
the instance because it is not in a "migrating" state.
Am I imagining things, or is there a race here? If so, the negative effects
would be that the resources of the migrating instance would be "lost", allowing
a newly-scheduled instance to claim the same resources (PCI devices, pinned
CPUs, etc.)
Chris
More information about the OpenStack-dev
mailing list