[openstack-dev] [nova] NUMA-aware live migration: easy but incomplete vs complete but hard

Jay Pipes jaypipes at gmail.com
Thu Jun 21 13:36:58 UTC 2018


On 06/18/2018 10:16 AM, Artom Lifshitz wrote:
> Hey all,
> 
> For Rocky I'm trying to get live migration to work properly for
> instances that have a NUMA topology [1].
> 
> A question that came up on one of patches [2] is how to handle
> resources claims on the destination, or indeed whether to handle that
> at all.
> 
> The previous attempt's approach [3] (call it A) was to use the
> resource tracker. This is race-free and the "correct" way to do it,
> but the code is pretty opaque and not easily reviewable, as evidenced
> by [3] sitting in review purgatory for literally years.
> 
> A simpler approach (call it B) is to ignore resource claims entirely
> for now and wait for NUMA in placement to land in order to handle it
> that way. This is obviously race-prone and not the "correct" way of
> doing it, but the code would be relatively easy to review.
> 
> For the longest time, live migration did not keep track of resources
> (until it started updating placement allocations). The message to
> operators was essentially "we're giving you this massive hammer, don't
> break your fingers." Continuing to ignore resource claims for now is
> just maintaining the status quo. In addition, there is value in
> improving NUMA live migration *now*, even if the improvement is
> incomplete because it's missing resource claims. "Best is the enemy of
> good" and all that. Finally, making use of the resource tracker is
> just work that we know will get thrown out once we start using
> placement for NUMA resources.
> 
> For all those reasons, I would favor approach B, but I wanted to ask
> the community for their thoughts.

Side question... does either approach touch PCI device management during 
live migration?

I ask because the only workloads I've ever seen that pin guest vCPU 
threads to specific host processors -- or make use of huge pages 
consumed from a specific host NUMA node -- have also made use of SR-IOV 
and/or PCI passthrough. [1]

If workloads that use PCI passthrough or SR-IOV VFs cannot be live 
migrated (due to existing complications in the lower-level virt layers) 
I don't see much of a point spending lots of developer resources trying 
to "fix" this situation when in the real world, only a mythical workload 
that uses CPU pinning or huge pages but *doesn't* use PCI passthrough or 
SR-IOV VFs would be helped by it.

Best,
-jay

[1 I know I'm only one person, but every workload I've seen that 
requires pinned CPUs and/or huge pages is a VNF that has been 
essentially an ASIC that a telco OEM/vendor has converted into software 
and requires the same guarantees that the ASIC and custom hardware gave 
the original hardware-based workload. These VNFs, every single one of 
them, used either PCI passthrough or SR-IOV VFs to handle 
latency-sensitive network I/O.



More information about the OpenStack-dev mailing list