[openstack-dev] realtime kvm cpu affinities

Chris Friesen chris.friesen at windriver.com
Mon Jun 26 18:12:49 UTC 2017

On 06/25/2017 02:09 AM, Sahid Orentino Ferdjaoui wrote:
> On Fri, Jun 23, 2017 at 10:34:26AM -0600, Chris Friesen wrote:
>> On 06/23/2017 09:35 AM, Henning Schild wrote:
>>> Am Fri, 23 Jun 2017 11:11:10 +0200
>>> schrieb Sahid Orentino Ferdjaoui <sferdjao at redhat.com>:
>>>> In Linux RT context, and as you mentioned, the non-RT vCPU can acquire
>>>> some guest kernel lock, then be pre-empted by emulator thread while
>>>> holding this lock. This situation blocks RT vCPUs from doing its
>>>> work. So that is why we have implemented [2]. For DPDK I don't think
>>>> we have such problems because it's running in userland.
>>>> So for DPDK context I think we could have a mask like we have for RT
>>>> and basically considering vCPU0 to handle best effort works (emulator
>>>> threads, SSH...). I think it's the current pattern used by DPDK users.
>>> DPDK is just a library and one can imagine an application that has
>>> cross-core communication/synchronisation needs where the emulator
>>> slowing down vpu0 will also slow down vcpu1. You DPDK application would
>>> have to know which of its cores did not get a full pcpu.
>>> I am not sure what the DPDK-example is doing in this discussion, would
>>> that not just be cpu_policy=dedicated? I guess normal behaviour of
>>> dedicated is that emulators and io happily share pCPUs with vCPUs and
>>> you are looking for a way to restrict emulators/io to a subset of pCPUs
>>> because you can live with some of them beeing not 100%.
>> Yes.  A typical DPDK-using VM might look something like this:
>> vCPU0: non-realtime, housekeeping and I/O, handles all virtual interrupts
>> and "normal" linux stuff, emulator runs on same pCPU
>> vCPU1: realtime, runs in tight loop in userspace processing packets
>> vCPU2: realtime, runs in tight loop in userspace processing packets
>> vCPU3: realtime, runs in tight loop in userspace processing packets
>> In this context, vCPUs 1-3 don't really ever enter the kernel, and we've
>> offloaded as much kernel work as possible from them onto vCPU0.  This works
>> pretty well with the current system.
>>>> For RT we have to isolate the emulator threads to an additional pCPU
>>>> per guests or as your are suggesting to a set of pCPUs for all the
>>>> guests running.
>>>> I think we should introduce a new option:
>>>>     - hw:cpu_emulator_threads_mask=^1
>>>> If on 'nova.conf' - that mask will be applied to the set of all host
>>>> CPUs (vcpu_pin_set) to basically pack the emulator threads of all VMs
>>>> running here (useful for RT context).
>>> That would allow modelling exactly what we need.
>>> In nova.conf we are talking absolute known values, no need for a mask
>>> and a set is much easier to read. Also using the same name does not
>>> sound like a good idea.
>>> And the name vcpu_pin_set clearly suggest what kind of load runs here,
>>> if using a mask it should be called pin_set.
>> I agree with Henning.
>> In nova.conf we should just use a set, something like
>> "rt_emulator_vcpu_pin_set" which would be used for running the emulator/io
>> threads of *only* realtime instances.
> I'm not agree with you, we have a set of pCPUs and we want to
> substract some of them for the emulator threads. We need a mask. The
> only set we need is to selection which pCPUs Nova can use
> (vcpus_pin_set).
>> We may also want to have "rt_emulator_overcommit_ratio" to control how many
>> threads/instances we allow per pCPU.
> Not really sure to have understand this point? If it is to indicate
> that for a pCPU isolated we want X guest emulator threads, the same
> behavior is achieved by the mask. A host for realtime is dedicated for
> realtime, no overcommitment and the operators know the number of host
> CPUs, they can easily deduct a ratio and so the corresponding mask.

Suppose I have a host with 64 CPUs.  I reserve three for host overhead and 
networking, leaving 61 for instances.  If I have instances with one non-RT vCPU 
and one RT vCPU then I can run 30 instances.  If instead my instances have one 
non-RT and 5 RT vCPUs then I can run 12 instances.  If I put all of my emulator 
threads on the same pCPU, it might make a difference whether I put 30 sets of 
emulator threads or 12 sets.

The proposed "rt_emulator_overcommit_ratio" would simply say "nova is allowed to 
run X instances worth of emulator threads on each pCPU in 
"rt_emulator_vcpu_pin_set".  If we've hit that threshold, then no more RT 
instances are allowed to schedule on this compute node (but non-RT instances 
would still be allowed).


More information about the OpenStack-dev mailing list