[openstack-dev] realtime kvm cpu affinities

Chris Friesen chris.friesen at windriver.com
Tue Jun 27 15:28:34 UTC 2017


On 06/27/2017 01:45 AM, Sahid Orentino Ferdjaoui wrote:
> On Mon, Jun 26, 2017 at 12:12:49PM -0600, Chris Friesen wrote:
>> On 06/25/2017 02:09 AM, Sahid Orentino Ferdjaoui wrote:
>>> On Fri, Jun 23, 2017 at 10:34:26AM -0600, Chris Friesen wrote:
>>>> On 06/23/2017 09:35 AM, Henning Schild wrote:
>>>>> Am Fri, 23 Jun 2017 11:11:10 +0200
>>>>> schrieb Sahid Orentino Ferdjaoui <sferdjao at redhat.com>:
>>>>
>>>>>> In Linux RT context, and as you mentioned, the non-RT vCPU can acquire
>>>>>> some guest kernel lock, then be pre-empted by emulator thread while
>>>>>> holding this lock. This situation blocks RT vCPUs from doing its
>>>>>> work. So that is why we have implemented [2]. For DPDK I don't think
>>>>>> we have such problems because it's running in userland.
>>>>>>
>>>>>> So for DPDK context I think we could have a mask like we have for RT
>>>>>> and basically considering vCPU0 to handle best effort works (emulator
>>>>>> threads, SSH...). I think it's the current pattern used by DPDK users.
>>>>>
>>>>> DPDK is just a library and one can imagine an application that has
>>>>> cross-core communication/synchronisation needs where the emulator
>>>>> slowing down vpu0 will also slow down vcpu1. You DPDK application would
>>>>> have to know which of its cores did not get a full pcpu.
>>>>>
>>>>> I am not sure what the DPDK-example is doing in this discussion, would
>>>>> that not just be cpu_policy=dedicated? I guess normal behaviour of
>>>>> dedicated is that emulators and io happily share pCPUs with vCPUs and
>>>>> you are looking for a way to restrict emulators/io to a subset of pCPUs
>>>>> because you can live with some of them beeing not 100%.
>>>>
>>>> Yes.  A typical DPDK-using VM might look something like this:
>>>>
>>>> vCPU0: non-realtime, housekeeping and I/O, handles all virtual interrupts
>>>> and "normal" linux stuff, emulator runs on same pCPU
>>>> vCPU1: realtime, runs in tight loop in userspace processing packets
>>>> vCPU2: realtime, runs in tight loop in userspace processing packets
>>>> vCPU3: realtime, runs in tight loop in userspace processing packets
>>>>
>>>> In this context, vCPUs 1-3 don't really ever enter the kernel, and we've
>>>> offloaded as much kernel work as possible from them onto vCPU0.  This works
>>>> pretty well with the current system.
>>>>
>>>>>> For RT we have to isolate the emulator threads to an additional pCPU
>>>>>> per guests or as your are suggesting to a set of pCPUs for all the
>>>>>> guests running.
>>>>>>
>>>>>> I think we should introduce a new option:
>>>>>>
>>>>>>      - hw:cpu_emulator_threads_mask=^1
>>>>>>
>>>>>> If on 'nova.conf' - that mask will be applied to the set of all host
>>>>>> CPUs (vcpu_pin_set) to basically pack the emulator threads of all VMs
>>>>>> running here (useful for RT context).
>>>>>
>>>>> That would allow modelling exactly what we need.
>>>>> In nova.conf we are talking absolute known values, no need for a mask
>>>>> and a set is much easier to read. Also using the same name does not
>>>>> sound like a good idea.
>>>>> And the name vcpu_pin_set clearly suggest what kind of load runs here,
>>>>> if using a mask it should be called pin_set.
>>>>
>>>> I agree with Henning.
>>>>
>>>> In nova.conf we should just use a set, something like
>>>> "rt_emulator_vcpu_pin_set" which would be used for running the emulator/io
>>>> threads of *only* realtime instances.
>>>
>>> I'm not agree with you, we have a set of pCPUs and we want to
>>> substract some of them for the emulator threads. We need a mask. The
>>> only set we need is to selection which pCPUs Nova can use
>>> (vcpus_pin_set).
>>>
>>>> We may also want to have "rt_emulator_overcommit_ratio" to control how many
>>>> threads/instances we allow per pCPU.
>>>
>>> Not really sure to have understand this point? If it is to indicate
>>> that for a pCPU isolated we want X guest emulator threads, the same
>>> behavior is achieved by the mask. A host for realtime is dedicated for
>>> realtime, no overcommitment and the operators know the number of host
>>> CPUs, they can easily deduct a ratio and so the corresponding mask.
>>
>> Suppose I have a host with 64 CPUs.  I reserve three for host overhead and
>> networking, leaving 61 for instances.  If I have instances with one non-RT
>> vCPU and one RT vCPU then I can run 30 instances.  If instead my instances
>> have one non-RT and 5 RT vCPUs then I can run 12 instances.  If I put all of
>> my emulator threads on the same pCPU, it might make a difference whether I
>> put 30 sets of emulator threads or 12 sets.
>
> Oh I understand your point now, but not sure that is going to make any
> difference. I would say the load in the isolated cores is probably
> going to be the same. Even that an overhead will be the number of
> threads handled which will be slightly higher in your first scenario.
>
>> The proposed "rt_emulator_overcommit_ratio" would simply say "nova is
>> allowed to run X instances worth of emulator threads on each pCPU in
>> "rt_emulator_vcpu_pin_set".  If we've hit that threshold, then no more RT
>> instances are allowed to schedule on this compute node (but non-RT instances
>> would still be allowed).
>
> Also I don't think we want to schedule where the emulator threads of
> the guests should be pinned on the isolated cores. We will let them
> float on the set of cores isolated. If there is a requierement to have
> them pinned so probably the current implementation will be enough.

Once you use "isolcpus" on the host, the host scheduler won't "float" threads 
between the CPUs based on load.  To get the float behaviour you'd have to not 
isolate the pCPUs that will be used for emulator threads, but then you run the 
risk of the host running other work on those pCPUs (unless you use cpusets or 
something to isolate the host work to a subset of non-isolcpus pCPUs).

Chris




More information about the OpenStack-dev mailing list