[openstack-dev] realtime kvm cpu affinities

Chris Friesen chris.friesen at windriver.com
Tue Jun 27 15:25:14 UTC 2017

On 06/27/2017 01:44 AM, Sahid Orentino Ferdjaoui wrote:
> On Mon, Jun 26, 2017 at 10:19:12AM +0200, Henning Schild wrote:
>> Am Sun, 25 Jun 2017 10:09:10 +0200
>> schrieb Sahid Orentino Ferdjaoui <sferdjao at redhat.com>:
>>> On Fri, Jun 23, 2017 at 10:34:26AM -0600, Chris Friesen wrote:
>>>> On 06/23/2017 09:35 AM, Henning Schild wrote:
>>>>> Am Fri, 23 Jun 2017 11:11:10 +0200
>>>>> schrieb Sahid Orentino Ferdjaoui <sferdjao at redhat.com>:
>>>>>> In Linux RT context, and as you mentioned, the non-RT vCPU can
>>>>>> acquire some guest kernel lock, then be pre-empted by emulator
>>>>>> thread while holding this lock. This situation blocks RT vCPUs
>>>>>> from doing its work. So that is why we have implemented [2].
>>>>>> For DPDK I don't think we have such problems because it's
>>>>>> running in userland.
>>>>>> So for DPDK context I think we could have a mask like we have
>>>>>> for RT and basically considering vCPU0 to handle best effort
>>>>>> works (emulator threads, SSH...). I think it's the current
>>>>>> pattern used by DPDK users.
>>>>> DPDK is just a library and one can imagine an application that has
>>>>> cross-core communication/synchronisation needs where the emulator
>>>>> slowing down vpu0 will also slow down vcpu1. You DPDK application
>>>>> would have to know which of its cores did not get a full pcpu.
>>>>> I am not sure what the DPDK-example is doing in this discussion,
>>>>> would that not just be cpu_policy=dedicated? I guess normal
>>>>> behaviour of dedicated is that emulators and io happily share
>>>>> pCPUs with vCPUs and you are looking for a way to restrict
>>>>> emulators/io to a subset of pCPUs because you can live with some
>>>>> of them beeing not 100%.
>>>> Yes.  A typical DPDK-using VM might look something like this:
>>>> vCPU0: non-realtime, housekeeping and I/O, handles all virtual
>>>> interrupts and "normal" linux stuff, emulator runs on same pCPU
>>>> vCPU1: realtime, runs in tight loop in userspace processing packets
>>>> vCPU2: realtime, runs in tight loop in userspace processing packets
>>>> vCPU3: realtime, runs in tight loop in userspace processing packets
>>>> In this context, vCPUs 1-3 don't really ever enter the kernel, and
>>>> we've offloaded as much kernel work as possible from them onto
>>>> vCPU0.  This works pretty well with the current system.
>>>>>> For RT we have to isolate the emulator threads to an additional
>>>>>> pCPU per guests or as your are suggesting to a set of pCPUs for
>>>>>> all the guests running.
>>>>>> I think we should introduce a new option:
>>>>>>     - hw:cpu_emulator_threads_mask=^1
>>>>>> If on 'nova.conf' - that mask will be applied to the set of all
>>>>>> host CPUs (vcpu_pin_set) to basically pack the emulator threads
>>>>>> of all VMs running here (useful for RT context).
>>>>> That would allow modelling exactly what we need.
>>>>> In nova.conf we are talking absolute known values, no need for a
>>>>> mask and a set is much easier to read. Also using the same name
>>>>> does not sound like a good idea.
>>>>> And the name vcpu_pin_set clearly suggest what kind of load runs
>>>>> here, if using a mask it should be called pin_set.
>>>> I agree with Henning.
>>>> In nova.conf we should just use a set, something like
>>>> "rt_emulator_vcpu_pin_set" which would be used for running the
>>>> emulator/io threads of *only* realtime instances.
>>> I'm not agree with you, we have a set of pCPUs and we want to
>>> substract some of them for the emulator threads. We need a mask. The
>>> only set we need is to selection which pCPUs Nova can use
>>> (vcpus_pin_set).
>> At that point it does not really matter whether it is a set or a mask.
>> They can both express the same where a set is easier to read/configure.
>> With the same argument you could say that vcpu_pin_set should be a mask
>> over the hosts pcpus.
>> As i said before: vcpu_pin_set should be renamed because all sorts of
>> threads are put here (pcpu_pin_set?). But that would be a bigger change
>> and should be discussed as a seperate issue.
>> So far we talked about a compute-node for realtime only doing realtime.
>> In that case vcpu_pin_set + emulator_io_mask would work. If you want to
>> run regular VMs on the same host, you can run a second nova, like we do.
>> We could also use vcpu_pin_set + rt_vcpu_pin_set(/mask). I think that
>> would allow modelling all cases in just one nova. Having all in one
>> nova, you could potentially repurpose rt cpus to best-effort and back.
>> Some day in the future ...
> That is not something we should allow or at least
> advertise. compute-node can't run both RT and non-RT guests and that
> because the nodes should have a kernel RT. We can't guarantee RT if
> both are on same nodes.

A compute node with an RT OS could run RT and non-RT guests at the same time 
just fine.  In a small cloud (think hyperconverged with maybe two nodes total) 
it's not viable to dedicate an entire node to just RT loads.

I'd personally rather see nova able to handle a mix of RT and non-RT than need 
to run multiple nova instances on the same node and figure out an up-front split 
of resources between RT nova and non-RT nova.  Better to allow nova to 
dynamically allocate resources as needed.


More information about the OpenStack-dev mailing list