[openstack-dev] realtime kvm cpu affinities
Henning Schild
henning.schild at siemens.com
Tue Jun 27 15:41:39 UTC 2017
Am Tue, 27 Jun 2017 09:25:14 -0600
schrieb Chris Friesen <chris.friesen at windriver.com>:
> On 06/27/2017 01:44 AM, Sahid Orentino Ferdjaoui wrote:
> > On Mon, Jun 26, 2017 at 10:19:12AM +0200, Henning Schild wrote:
> >> Am Sun, 25 Jun 2017 10:09:10 +0200
> >> schrieb Sahid Orentino Ferdjaoui <sferdjao at redhat.com>:
> >>
> >>> On Fri, Jun 23, 2017 at 10:34:26AM -0600, Chris Friesen wrote:
> >>>> On 06/23/2017 09:35 AM, Henning Schild wrote:
> >>>>> Am Fri, 23 Jun 2017 11:11:10 +0200
> >>>>> schrieb Sahid Orentino Ferdjaoui <sferdjao at redhat.com>:
> >>>>
> >>>>>> In Linux RT context, and as you mentioned, the non-RT vCPU can
> >>>>>> acquire some guest kernel lock, then be pre-empted by emulator
> >>>>>> thread while holding this lock. This situation blocks RT vCPUs
> >>>>>> from doing its work. So that is why we have implemented [2].
> >>>>>> For DPDK I don't think we have such problems because it's
> >>>>>> running in userland.
> >>>>>>
> >>>>>> So for DPDK context I think we could have a mask like we have
> >>>>>> for RT and basically considering vCPU0 to handle best effort
> >>>>>> works (emulator threads, SSH...). I think it's the current
> >>>>>> pattern used by DPDK users.
> >>>>>
> >>>>> DPDK is just a library and one can imagine an application that
> >>>>> has cross-core communication/synchronisation needs where the
> >>>>> emulator slowing down vpu0 will also slow down vcpu1. You DPDK
> >>>>> application would have to know which of its cores did not get a
> >>>>> full pcpu.
> >>>>>
> >>>>> I am not sure what the DPDK-example is doing in this discussion,
> >>>>> would that not just be cpu_policy=dedicated? I guess normal
> >>>>> behaviour of dedicated is that emulators and io happily share
> >>>>> pCPUs with vCPUs and you are looking for a way to restrict
> >>>>> emulators/io to a subset of pCPUs because you can live with some
> >>>>> of them beeing not 100%.
> >>>>
> >>>> Yes. A typical DPDK-using VM might look something like this:
> >>>>
> >>>> vCPU0: non-realtime, housekeeping and I/O, handles all virtual
> >>>> interrupts and "normal" linux stuff, emulator runs on same pCPU
> >>>> vCPU1: realtime, runs in tight loop in userspace processing
> >>>> packets vCPU2: realtime, runs in tight loop in userspace
> >>>> processing packets vCPU3: realtime, runs in tight loop in
> >>>> userspace processing packets
> >>>>
> >>>> In this context, vCPUs 1-3 don't really ever enter the kernel,
> >>>> and we've offloaded as much kernel work as possible from them
> >>>> onto vCPU0. This works pretty well with the current system.
> >>>>
> >>>>>> For RT we have to isolate the emulator threads to an additional
> >>>>>> pCPU per guests or as your are suggesting to a set of pCPUs for
> >>>>>> all the guests running.
> >>>>>>
> >>>>>> I think we should introduce a new option:
> >>>>>>
> >>>>>> - hw:cpu_emulator_threads_mask=^1
> >>>>>>
> >>>>>> If on 'nova.conf' - that mask will be applied to the set of all
> >>>>>> host CPUs (vcpu_pin_set) to basically pack the emulator threads
> >>>>>> of all VMs running here (useful for RT context).
> >>>>>
> >>>>> That would allow modelling exactly what we need.
> >>>>> In nova.conf we are talking absolute known values, no need for a
> >>>>> mask and a set is much easier to read. Also using the same name
> >>>>> does not sound like a good idea.
> >>>>> And the name vcpu_pin_set clearly suggest what kind of load runs
> >>>>> here, if using a mask it should be called pin_set.
> >>>>
> >>>> I agree with Henning.
> >>>>
> >>>> In nova.conf we should just use a set, something like
> >>>> "rt_emulator_vcpu_pin_set" which would be used for running the
> >>>> emulator/io threads of *only* realtime instances.
> >>>
> >>> I'm not agree with you, we have a set of pCPUs and we want to
> >>> substract some of them for the emulator threads. We need a mask.
> >>> The only set we need is to selection which pCPUs Nova can use
> >>> (vcpus_pin_set).
> >>
> >> At that point it does not really matter whether it is a set or a
> >> mask. They can both express the same where a set is easier to
> >> read/configure. With the same argument you could say that
> >> vcpu_pin_set should be a mask over the hosts pcpus.
> >>
> >> As i said before: vcpu_pin_set should be renamed because all sorts
> >> of threads are put here (pcpu_pin_set?). But that would be a
> >> bigger change and should be discussed as a seperate issue.
> >>
> >> So far we talked about a compute-node for realtime only doing
> >> realtime. In that case vcpu_pin_set + emulator_io_mask would work.
> >> If you want to run regular VMs on the same host, you can run a
> >> second nova, like we do.
> >>
> >> We could also use vcpu_pin_set + rt_vcpu_pin_set(/mask). I think
> >> that would allow modelling all cases in just one nova. Having all
> >> in one nova, you could potentially repurpose rt cpus to
> >> best-effort and back. Some day in the future ...
> >
> > That is not something we should allow or at least
> > advertise. compute-node can't run both RT and non-RT guests and that
> > because the nodes should have a kernel RT. We can't guarantee RT if
> > both are on same nodes.
>
> A compute node with an RT OS could run RT and non-RT guests at the
> same time just fine. In a small cloud (think hyperconverged with
> maybe two nodes total) it's not viable to dedicate an entire node to
> just RT loads.
>
> I'd personally rather see nova able to handle a mix of RT and non-RT
> than need to run multiple nova instances on the same node and figure
> out an up-front split of resources between RT nova and non-RT nova.
> Better to allow nova to dynamically allocate resources as needed.
I am with you, except for the "dynamically". That is something one can
think of when the "static" case works.
Henning
> Chris
More information about the OpenStack-dev
mailing list