[openstack-dev] realtime kvm cpu affinities
henning.schild at siemens.com
Wed Jun 21 10:47:27 UTC 2017
Am Tue, 20 Jun 2017 10:04:30 -0400
schrieb Luiz Capitulino <lcapitulino at redhat.com>:
> On Tue, 20 Jun 2017 09:48:23 +0200
> Henning Schild <henning.schild at siemens.com> wrote:
> > Hi,
> > We are using OpenStack for managing realtime guests. We modified
> > it and contributed to discussions on how to model the realtime
> > feature. More recent versions of OpenStack have support for
> > realtime, and there are a few proposals on how to improve that
> > further.
> > But there is still no full answer on how to distribute threads
> > across host-cores. The vcpus are easy but for the emulation and
> > io-threads there are multiple options. I would like to collect the
> > constraints from a qemu/kvm perspective first, and than possibly
> > influence the OpenStack development
> > I will put the summary/questions first, the text below provides more
> > context to where the questions come from.
> > - How do you distribute your threads when reaching the really low
> > cyclictest results in the guests? In  Rik talked about problems
> > like hold holder preemption, starvation etc. but not where/how to
> > schedule emulators and io
> We put emulator threads and io-threads in housekeeping cores in
> the host. I think housekeeping cores is what you're calling
> best-effort cores, those are non-isolated cores that will run host
As expected, any best-effort/housekeeping core will do but overlap with
the vcpu-cores is a bad idea.
> > - Is it ok to put a vcpu and emulator thread on the same core as
> > long as the guest knows about it? Any funny behaving guest, not
> > just Linux.
> We can't do this for KVM-RT because we run all vcpu threads with
> FIFO priority.
Same point as above, meaning the "hw:cpu_realtime_mask" approach is
wrong for realtime.
> However, we have another project with DPDK whose goal is to achieve
> zero-loss networking. The configuration required by this project is
> very similar to the one required by KVM-RT. One difference though is
> that we don't use RT and hence don't use FIFO priority.
> In this project we've been running with the emulator thread and a
> vcpu sharing the same core. As long as the guest housekeeping CPUs
> are idle, we don't get any packet drops (most of the time, what
> causes packet drops in this test-case would cause spikes in
> cyclictest). However, we're seeing some packet drops for certain
> guest workloads which we are still debugging.
Ok but that seems to be a different scenario where hw:cpu_policy
dedicated should be sufficient. However if the placement of the io and
emulators has to be on a subset of the dedicated cpus something like
hw:cpu_realtime_mask would be required.
> > - Is it ok to make the emulators potentially slow by running them on
> > busy best-effort cores, or will they quickly be on the critical
> > path if you do more than just cyclictest? - our experience says we
> > don't need them reactive even with rt-networking involved
> I believe it is ok.
> > Our goal is to reach a high packing density of realtime VMs. Our
> > pragmatic first choice was to run all non-vcpu-threads on a shared
> > set of pcpus where we also run best-effort VMs and host load.
> > Now the OpenStack guys are not too happy with that because that is
> > load outside the assigned resources, which leads to quota and
> > accounting problems.
> > So the current OpenStack model is to run those threads next to one
> > or more vcpu-threads.  You will need to remember that the vcpus
> > in question should not be your rt-cpus in the guest. I.e. if vcpu0
> > shares its pcpu with the hypervisor noise your preemptrt-guest
> > would use isolcpus=1.
> > Is that kind of sharing a pcpu really a good idea? I could imagine
> > things like smp housekeeping (cache invalidation etc.) to eventually
> > cause vcpu1 having to wait for the emulator stuck in IO.
> Agreed. IIRC, in the beginning of KVM-RT we saw a problem where
> running vcpu0 on an non-isolated core and without FIFO priority
> caused spikes in vcpu1. I guess we debugged this down to vcpu1
> waiting a few dozen microseconds for vcpu0 for some reason. Running
> vcpu0 on a isolated core with FIFO priority fixed this (again, this
> was years ago, I won't remember all the details).
> > Or maybe a busy polling vcpu0 starving its own emulator causing high
> > latency or even deadlocks.
> This will probably happen if you run vcpu0 with FIFO priority.
Two more points that indicate that hw:cpu_realtime_mask (putting
emulators/io next to any vcpu) does not work for general rt.
> > Even if it happens to work for Linux guests it seems like a strong
> > assumption that an rt-guest that has noise cores can deal with even
> > more noise one scheduling level below.
> > More recent proposals  suggest a scheme where the emulator and io
> > threads are on a separate core. That sounds more reasonable /
> > conservative but dramatically increases the per VM cost. And the
> > pcpus hosting the hypervisor threads will probably be idle most of
> > the time.
> I don't know how to solve this problem. Maybe if we dedicate only one
> core for all emulator threads and io-threads of a VM would mitigate
> this? Of course we'd have to test it to see if this doesn't give
 suggests exactly that but it is a waste of pcpus. Say a vcpu needs
1.0 cores and all other threads need 0.05 cores. The real need of a 1
core rt-vm would be 1.05 for two it would be 2.05.
With  we pack 2.05 onto 2 pcpus, that does not work. With  we
need 3 and waste 0.95.
> > I guess in this context the most important question is whether qemu
> > is ever involved in "regular operation" if you avoid the obvious IO
> > problems on your critical path.
> > My guess is that just  has serious hidden latency problems and
> >  is taking it a step too far by wasting whole cores for idle
> > emulators. We would like to suggest some other way inbetween, that
> > is a little easier on the core count. Our current solution seems to
> > work fine but has the mentioned quota problems.
> What is your solution?
We have a kilo-based prototype that introduced emulator_pin_set in
nova.conf. All vcpu threads will be scheduled on vcpu_pin_set and
emulators and IO of all VMs will share emulator_pin_set.
vcpu_pin_set contains isolcpus from the host and emulator_pin_set
contains best-effort cores from the host.
That basically means you put all emulators and io of all VMs onto a set
of cores that the host potentially also uses for other stuff. Sticking
with the made up numbers from above, all the 0.05s can share pcpus.
With the current implementation in mitaka (hw:cpu_realtime_mask) you
can not have a single-core rt-vm because you can not put 1.05 into 1
without overcommitting. You can put 2.05 into 2 but as you confirmed
the overcommitted core could still slow down the truly exclusive one.
On a 4-core host you get a maximum of 1 rt-VMs (2-3 cores).
With , which is not implemented yet, the overcommitting is avoided.
But now you waste a lot of pcpus. 1.05 = 2, 2.05 = 3
On a 4-core host you get a maximum of 1 rt-VMs (1-2 cores).
With our approach it might be hard to account for emulator and
io-threads because they share pcpus. But you do not run into
overcommitting and don't waste pcpus at the same time.
On a 4-core host you get a maximum of 3 rt-VMs (1 core), 1 rt-VMs (2-3
> > With this mail i am hoping to collect some constraints to derive a
> > suggestion from. Or maybe collect some information that could be
> > added to the current blueprints as reasoning/documentation.
> > Sorry if you receive this mail a second time, i was not subscribed
> > to openstack-dev the first time.
> > best regards,
> > Henning
> > 
> > https://specs.openstack.org/openstack/nova-specs/specs/mitaka/implemented/libvirt-real-time.html
> > 
> > https://specs.openstack.org/openstack/nova-specs/specs/ocata/approved/libvirt-emulator-threads-policy.html
> > 
> > http://events.linuxfoundation.org/sites/events/files/slides/kvmforum2015-realtimekvm.pdf
More information about the OpenStack-dev