[openstack-dev] realtime kvm cpu affinities
Henning Schild
henning.schild at siemens.com
Fri Jun 23 15:35:18 UTC 2017
Am Fri, 23 Jun 2017 11:11:10 +0200
schrieb Sahid Orentino Ferdjaoui <sferdjao at redhat.com>:
> On Wed, Jun 21, 2017 at 12:47:27PM +0200, Henning Schild wrote:
> > Am Tue, 20 Jun 2017 10:04:30 -0400
> > schrieb Luiz Capitulino <lcapitulino at redhat.com>:
> >
> > > On Tue, 20 Jun 2017 09:48:23 +0200
> > > Henning Schild <henning.schild at siemens.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > We are using OpenStack for managing realtime guests. We modified
> > > > it and contributed to discussions on how to model the realtime
> > > > feature. More recent versions of OpenStack have support for
> > > > realtime, and there are a few proposals on how to improve that
> > > > further.
> > > >
> > > > But there is still no full answer on how to distribute threads
> > > > across host-cores. The vcpus are easy but for the emulation and
> > > > io-threads there are multiple options. I would like to collect
> > > > the constraints from a qemu/kvm perspective first, and than
> > > > possibly influence the OpenStack development
> > > >
> > > > I will put the summary/questions first, the text below provides
> > > > more context to where the questions come from.
> > > > - How do you distribute your threads when reaching the really
> > > > low cyclictest results in the guests? In [3] Rik talked about
> > > > problems like hold holder preemption, starvation etc. but not
> > > > where/how to schedule emulators and io
> > >
> > > We put emulator threads and io-threads in housekeeping cores in
> > > the host. I think housekeeping cores is what you're calling
> > > best-effort cores, those are non-isolated cores that will run host
> > > load.
> >
> > As expected, any best-effort/housekeeping core will do but overlap
> > with the vcpu-cores is a bad idea.
> >
> > > > - Is it ok to put a vcpu and emulator thread on the same core as
> > > > long as the guest knows about it? Any funny behaving guest, not
> > > > just Linux.
> > >
> > > We can't do this for KVM-RT because we run all vcpu threads with
> > > FIFO priority.
> >
> > Same point as above, meaning the "hw:cpu_realtime_mask" approach is
> > wrong for realtime.
> >
> > > However, we have another project with DPDK whose goal is to
> > > achieve zero-loss networking. The configuration required by this
> > > project is very similar to the one required by KVM-RT. One
> > > difference though is that we don't use RT and hence don't use
> > > FIFO priority.
> > >
> > > In this project we've been running with the emulator thread and a
> > > vcpu sharing the same core. As long as the guest housekeeping CPUs
> > > are idle, we don't get any packet drops (most of the time, what
> > > causes packet drops in this test-case would cause spikes in
> > > cyclictest). However, we're seeing some packet drops for certain
> > > guest workloads which we are still debugging.
> >
> > Ok but that seems to be a different scenario where hw:cpu_policy
> > dedicated should be sufficient. However if the placement of the io
> > and emulators has to be on a subset of the dedicated cpus something
> > like hw:cpu_realtime_mask would be required.
> >
> > > > - Is it ok to make the emulators potentially slow by running
> > > > them on busy best-effort cores, or will they quickly be on the
> > > > critical path if you do more than just cyclictest? - our
> > > > experience says we don't need them reactive even with
> > > > rt-networking involved
> > >
> > > I believe it is ok.
> >
> > Ok.
> >
> > > > Our goal is to reach a high packing density of realtime VMs. Our
> > > > pragmatic first choice was to run all non-vcpu-threads on a
> > > > shared set of pcpus where we also run best-effort VMs and host
> > > > load. Now the OpenStack guys are not too happy with that
> > > > because that is load outside the assigned resources, which
> > > > leads to quota and accounting problems.
> > > >
> > > > So the current OpenStack model is to run those threads next to
> > > > one or more vcpu-threads. [1] You will need to remember that
> > > > the vcpus in question should not be your rt-cpus in the guest.
> > > > I.e. if vcpu0 shares its pcpu with the hypervisor noise your
> > > > preemptrt-guest would use isolcpus=1.
> > > >
> > > > Is that kind of sharing a pcpu really a good idea? I could
> > > > imagine things like smp housekeeping (cache invalidation etc.)
> > > > to eventually cause vcpu1 having to wait for the emulator stuck
> > > > in IO.
> > >
> > > Agreed. IIRC, in the beginning of KVM-RT we saw a problem where
> > > running vcpu0 on an non-isolated core and without FIFO priority
> > > caused spikes in vcpu1. I guess we debugged this down to vcpu1
> > > waiting a few dozen microseconds for vcpu0 for some reason.
> > > Running vcpu0 on a isolated core with FIFO priority fixed this
> > > (again, this was years ago, I won't remember all the details).
> > >
> > > > Or maybe a busy polling vcpu0 starving its own emulator causing
> > > > high latency or even deadlocks.
> > >
> > > This will probably happen if you run vcpu0 with FIFO priority.
> >
> > Two more points that indicate that hw:cpu_realtime_mask (putting
> > emulators/io next to any vcpu) does not work for general rt.
> >
> > > > Even if it happens to work for Linux guests it seems like a
> > > > strong assumption that an rt-guest that has noise cores can
> > > > deal with even more noise one scheduling level below.
> > > >
> > > > More recent proposals [2] suggest a scheme where the emulator
> > > > and io threads are on a separate core. That sounds more
> > > > reasonable / conservative but dramatically increases the per VM
> > > > cost. And the pcpus hosting the hypervisor threads will
> > > > probably be idle most of the time.
> > >
> > > I don't know how to solve this problem. Maybe if we dedicate only
> > > one core for all emulator threads and io-threads of a VM would
> > > mitigate this? Of course we'd have to test it to see if this
> > > doesn't give spikes.
> >
> > [2] suggests exactly that but it is a waste of pcpus. Say a vcpu
> > needs 1.0 cores and all other threads need 0.05 cores. The real
> > need of a 1 core rt-vm would be 1.05 for two it would be 2.05.
> > With [1] we pack 2.05 onto 2 pcpus, that does not work. With [2] we
> > need 3 and waste 0.95.
> >
> > > > I guess in this context the most important question is whether
> > > > qemu is ever involved in "regular operation" if you avoid the
> > > > obvious IO problems on your critical path.
> > > >
> > > > My guess is that just [1] has serious hidden latency problems
> > > > and [2] is taking it a step too far by wasting whole cores for
> > > > idle emulators. We would like to suggest some other way
> > > > inbetween, that is a little easier on the core count. Our
> > > > current solution seems to work fine but has the mentioned quota
> > > > problems.
> > >
> > > What is your solution?
> >
> > We have a kilo-based prototype that introduced emulator_pin_set in
> > nova.conf. All vcpu threads will be scheduled on vcpu_pin_set and
> > emulators and IO of all VMs will share emulator_pin_set.
> > vcpu_pin_set contains isolcpus from the host and emulator_pin_set
> > contains best-effort cores from the host.
> > That basically means you put all emulators and io of all VMs onto a
> > set of cores that the host potentially also uses for other stuff.
> > Sticking with the made up numbers from above, all the 0.05s can
> > share pcpus.
> >
> > With the current implementation in mitaka (hw:cpu_realtime_mask) you
> > can not have a single-core rt-vm because you can not put 1.05 into 1
> > without overcommitting. You can put 2.05 into 2 but as you confirmed
> > the overcommitted core could still slow down the truly exclusive
> > one. On a 4-core host you get a maximum of 1 rt-VMs (2-3 cores).
> >
> > With [2], which is not implemented yet, the overcommitting is
> > avoided. But now you waste a lot of pcpus. 1.05 = 2, 2.05 = 3
> > On a 4-core host you get a maximum of 1 rt-VMs (1-2 cores).
> >
> > With our approach it might be hard to account for emulator and
> > io-threads because they share pcpus. But you do not run into
> > overcommitting and don't waste pcpus at the same time.
> > On a 4-core host you get a maximum of 3 rt-VMs (1 core), 1 rt-VMs
> > (2-3 cores)
>
> I think your solution is good.
>
> In Linux RT context, and as you mentioned, the non-RT vCPU can acquire
> some guest kernel lock, then be pre-empted by emulator thread while
> holding this lock. This situation blocks RT vCPUs from doing its
> work. So that is why we have implemented [2]. For DPDK I don't think
> we have such problems because it's running in userland.
>
> So for DPDK context I think we could have a mask like we have for RT
> and basically considering vCPU0 to handle best effort works (emulator
> threads, SSH...). I think it's the current pattern used by DPDK users.
DPDK is just a library and one can imagine an application that has
cross-core communication/synchronisation needs where the emulator
slowing down vpu0 will also slow down vcpu1. You DPDK application would
have to know which of its cores did not get a full pcpu.
I am not sure what the DPDK-example is doing in this discussion, would
that not just be cpu_policy=dedicated? I guess normal behaviour of
dedicated is that emulators and io happily share pCPUs with vCPUs and
you are looking for a way to restrict emulators/io to a subset of pCPUs
because you can live with some of them beeing not 100%.
> For RT we have to isolate the emulator threads to an additional pCPU
> per guests or as your are suggesting to a set of pCPUs for all the
> guests running.
>
> I think we should introduce a new option:
>
> - hw:cpu_emulator_threads_mask=^1
>
> If on 'nova.conf' - that mask will be applied to the set of all host
> CPUs (vcpu_pin_set) to basically pack the emulator threads of all VMs
> running here (useful for RT context).
That would allow modelling exactly what we need.
In nova.conf we are talking absolute known values, no need for a mask
and a set is much easier to read. Also using the same name does not
sound like a good idea.
And the name vcpu_pin_set clearly suggest what kind of load runs here,
if using a mask it should be called pin_set.
> If on flavor extra-specs It will be applied to the vCPUs dedicated for
> the guest (useful for DPDK context).
And if both are present the flavor wins and nova.conf is ignored?
Henning
> s.
>
> > Henning
> >
> > > > With this mail i am hoping to collect some constraints to
> > > > derive a suggestion from. Or maybe collect some information
> > > > that could be added to the current blueprints as
> > > > reasoning/documentation.
> > > >
> > > > Sorry if you receive this mail a second time, i was not
> > > > subscribed to openstack-dev the first time.
> > > >
> > > > best regards,
> > > > Henning
> > > >
> > > > [1]
> > > > https://specs.openstack.org/openstack/nova-specs/specs/mitaka/implemented/libvirt-real-time.html
> > > > [2]
> > > > https://specs.openstack.org/openstack/nova-specs/specs/ocata/approved/libvirt-emulator-threads-policy.html
> > > > [3]
> > > > http://events.linuxfoundation.org/sites/events/files/slides/kvmforum2015-realtimekvm.pdf
> > > >
> > >
More information about the OpenStack-dev
mailing list