[openstack-dev] [nova] Plans to fix numa_topology related issues with migration/resize/evacuate

Wensley, Barton Barton.Wensley at windriver.com
Wed Mar 4 15:17:24 UTC 2015


Hi,

I have been exercising the numa topology related features in kilo (cpu 
pinning, numa topology, huge pages) and have seen that there are issues
when an operation moves an instance between compute nodes. In summary,
the numa_topology is not recalculated for the destination node, which 
results in the instance running with the wrong topology (or even 
failing to run if the topology isn't supported on the destination). 
This impacts live migration, cold migration, resize and evacuate.

I have spent some time over the last couple weeks and have a working 
fix for these issues that I would like to push upstream. The fix for
cold migration and resize is the most straightfoward, so I plan to
start there.

At a high level, here is what I have done to fix cold migrate and 
resize:
- Add the source_numa_topology and dest_numa_topology to the migration 
  object and migrations table.
- When a resize_claim is done, store the claimed numa topology in the
  dest_numa_topology in the migration record. Also store the current 
  numa topology as the source_numa_topology in the migration record.
- Use the source_numa_topology and dest_numa_topology from the 
  migration record in the resource accounting when referencing 
  migration claims as appropriate. This is done for claims, dropped 
  claims and the resource audit.
- Set the numa_topology in the instance after the cold migration/resize
  is finished to the dest_numa_topology from the migration object - 
  done in finish_resize RPC on the destination compute to match where 
  the rest of the resources for the instance are updated (there is a 
  call to _set_instance_info here that sets the memory, vcpus, disk 
  space, etc... for the migrated instance).
- Set the numa_topology in the instance if the cold migration/resize is 
  reverted to the source_numa_topology from the migration object - 
  done in finish_revert_resize RPC on the source compute.

I would appreciate any comments on my approach. I plan to start
submitting the code for this against bug 1417667 - I will split it
into several chunks to make it easier to review.

Fixing live migration was significantly more effort - I'll start a
different thread on that once I have feedback on the above approach.

Thanks,

Bart Wensley, Member of Technical Staff, Wind River




More information about the OpenStack-dev mailing list