[openstack-dev] [nova] theoretical race between live migration and resource audit?
lương hữu tuấn
tuantuluong at gmail.com
Fri Jun 10 09:20:11 UTC 2016
Yes, it is actually a race and we have already faced a negative effect when
using evacuation. Some information of cpu pinning is lost. Imagine that, in
some cases, we do some re-scheduling actions (evacuate, live-migration,
etc.) then immediately do the next actions (delete, resize, etc.) before
the resource_tracker updates in the next period. In this case, it fails.
Actually, it has some negative side in writing tests based on the scenario
On Fri, Jun 10, 2016 at 10:36 AM, Matthew Booth <mbooth at redhat.com> wrote:
> Yes, this is a race.
> However, it's my understanding that this is 'ok'. The resource tracker
> doesn't claim to be 100% accurate at all times, right? Otherwise why would
> it update itself in a period task in the first place. It's my understanding
> that the resource tracker is basically a best effort cache, and that
> scheduling decisions can still fail at the host. The resource tracker will
> fix itself next time it runs via its periodic task.
> Matt (not a scheduler person)
> On Thu, Jun 9, 2016 at 10:41 PM, Chris Friesen <
> chris.friesen at windriver.com> wrote:
>> I'm wondering if we might have a race between live migration and the
>> resource audit. I've included a few people on the receiver list that have
>> worked directly with this code in the past.
>> In _update_available_resource() we have code that looks like this:
>> instances = objects.InstanceList.get_by_host_and_node()
>> migrations = objects.MigrationList.get_in_progress_by_host_and_node()
>> In post_live_migration_at_destination() we do this (updating the host and
>> node as well as the task state):
>> instance.host = self.host
>> instance.task_state = None
>> instance.node = node_name
>> And in _post_live_migration() we update the migration status to
>> if migrate_data and migrate_data.get('migration'):
>> migrate_data['migration'].status = 'completed'
>> Both of the latter routines are not serialized by the
>> COMPUTE_RESOURCE_SEMAPHORE, so they can race relative to the code in
>> I'm wondering if we can have a situation like this:
>> 1) migration in progress
>> 2) We start running _update_available_resource() on destination, and we
>> call instances = objects.InstanceList.get_by_host_and_node(). This will
>> not return the migration, because it is not yet on the destination host.
>> 3) The migration completes and we call
>> post_live_migration_at_destination(), which sets the host/node/task_state
>> on the instance.
>> 4) In _update_available_resource() on destination, we call migrations =
>> objects.MigrationList.get_in_progress_by_host_and_node(). This will return
>> the migration for the instance in question, but when we run
>> self._update_usage_from_migrations() the uuid will not be in "instances"
>> and so we will use the instance from the newly-queried migration. We will
>> then ignore the instance because it is not in a "migrating" state.
>> Am I imagining things, or is there a race here? If so, the negative
>> effects would be that the resources of the migrating instance would be
>> "lost", allowing a newly-scheduled instance to claim the same resources
>> (PCI devices, pinned CPUs, etc.)
>> OpenStack Development Mailing List (not for usage questions)
>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> Matthew Booth
> Red Hat Engineering, Virtualisation Team
> Phone: +442070094448 (UK)
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the OpenStack-dev