[openstack-dev] [nova] NUMA-aware live migration: easy but incomplete vs complete but hard

Sahid Orentino Ferdjaoui sferdjao at redhat.com
Thu Jun 21 14:28:44 UTC 2018


On Thu, Jun 21, 2018 at 09:36:58AM -0400, Jay Pipes wrote:
> On 06/18/2018 10:16 AM, Artom Lifshitz wrote:
> > Hey all,
> > 
> > For Rocky I'm trying to get live migration to work properly for
> > instances that have a NUMA topology [1].
> > 
> > A question that came up on one of patches [2] is how to handle
> > resources claims on the destination, or indeed whether to handle that
> > at all.
> > 
> > The previous attempt's approach [3] (call it A) was to use the
> > resource tracker. This is race-free and the "correct" way to do it,
> > but the code is pretty opaque and not easily reviewable, as evidenced
> > by [3] sitting in review purgatory for literally years.
> > 
> > A simpler approach (call it B) is to ignore resource claims entirely
> > for now and wait for NUMA in placement to land in order to handle it
> > that way. This is obviously race-prone and not the "correct" way of
> > doing it, but the code would be relatively easy to review.
> > 
> > For the longest time, live migration did not keep track of resources
> > (until it started updating placement allocations). The message to
> > operators was essentially "we're giving you this massive hammer, don't
> > break your fingers." Continuing to ignore resource claims for now is
> > just maintaining the status quo. In addition, there is value in
> > improving NUMA live migration *now*, even if the improvement is
> > incomplete because it's missing resource claims. "Best is the enemy of
> > good" and all that. Finally, making use of the resource tracker is
> > just work that we know will get thrown out once we start using
> > placement for NUMA resources.
> > 
> > For all those reasons, I would favor approach B, but I wanted to ask
> > the community for their thoughts.
> 
> Side question... does either approach touch PCI device management during
> live migration?
> 
> I ask because the only workloads I've ever seen that pin guest vCPU threads
> to specific host processors -- or make use of huge pages consumed from a
> specific host NUMA node -- have also made use of SR-IOV and/or PCI
> passthrough. [1]

Not really. There are lot of virtual switches that we do support like
OVS-DPDK, Contrail Virtual Router... that support vhostuser interfaces
which is one use-case. (We do support live-migration of vhostuser
interface)

> If workloads that use PCI passthrough or SR-IOV VFs cannot be live migrated
> (due to existing complications in the lower-level virt layers) I don't see
> much of a point spending lots of developer resources trying to "fix" this
> situation when in the real world, only a mythical workload that uses CPU
> pinning or huge pages but *doesn't* use PCI passthrough or SR-IOV VFs would
> be helped by it.
> 
> Best,
> -jay
> 
> [1 I know I'm only one person, but every workload I've seen that requires
> pinned CPUs and/or huge pages is a VNF that has been essentially an ASIC
> that a telco OEM/vendor has converted into software and requires the same
> guarantees that the ASIC and custom hardware gave the original
> hardware-based workload. These VNFs, every single one of them, used either
> PCI passthrough or SR-IOV VFs to handle latency-sensitive network I/O.
> 
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



More information about the OpenStack-dev mailing list