Open Stack

Fri Jun 23 09:11:10 UTC 2017

On Wed, Jun 21, 2017 at 12:47:27PM +0200, Henning Schild wrote:
> Am Tue, 20 Jun 2017 10:04:30 -0400
> schrieb Luiz Capitulino <lcapitulino at redhat.com>:
> 
> > On Tue, 20 Jun 2017 09:48:23 +0200
> > Henning Schild <henning.schild at siemens.com> wrote:
> > 
> > > Hi,
> > > 
> > > We are using OpenStack for managing realtime guests. We modified
> > > it and contributed to discussions on how to model the realtime
> > > feature. More recent versions of OpenStack have support for
> > > realtime, and there are a few proposals on how to improve that
> > > further.
> > > 
> > > But there is still no full answer on how to distribute threads
> > > across host-cores. The vcpus are easy but for the emulation and
> > > io-threads there are multiple options. I would like to collect the
> > > constraints from a qemu/kvm perspective first, and than possibly
> > > influence the OpenStack development
> > > 
> > > I will put the summary/questions first, the text below provides more
> > > context to where the questions come from.
> > > - How do you distribute your threads when reaching the really low
> > >   cyclictest results in the guests? In [3] Rik talked about problems
> > >   like hold holder preemption, starvation etc. but not where/how to
> > >   schedule emulators and io  
> > 
> > We put emulator threads and io-threads in housekeeping cores in
> > the host. I think housekeeping cores is what you're calling
> > best-effort cores, those are non-isolated cores that will run host
> > load.
> 
> As expected, any best-effort/housekeeping core will do but overlap with
> the vcpu-cores is a bad idea.
> 
> > > - Is it ok to put a vcpu and emulator thread on the same core as
> > > long as the guest knows about it? Any funny behaving guest, not
> > > just Linux.  
> > 
> > We can't do this for KVM-RT because we run all vcpu threads with
> > FIFO priority.
> 
> Same point as above, meaning the "hw:cpu_realtime_mask" approach is
> wrong for realtime.
> 
> > However, we have another project with DPDK whose goal is to achieve
> > zero-loss networking. The configuration required by this project is
> > very similar to the one required by KVM-RT. One difference though is
> > that we don't use RT and hence don't use FIFO priority.
> > 
> > In this project we've been running with the emulator thread and a
> > vcpu sharing the same core. As long as the guest housekeeping CPUs
> > are idle, we don't get any packet drops (most of the time, what
> > causes packet drops in this test-case would cause spikes in
> > cyclictest). However, we're seeing some packet drops for certain
> > guest workloads which we are still debugging.
> 
> Ok but that seems to be a different scenario where hw:cpu_policy
> dedicated should be sufficient. However if the placement of the io and
> emulators has to be on a subset of the dedicated cpus something like
> hw:cpu_realtime_mask would be required.
> 
> > > - Is it ok to make the emulators potentially slow by running them on
> > >   busy best-effort cores, or will they quickly be on the critical
> > > path if you do more than just cyclictest? - our experience says we
> > > don't need them reactive even with rt-networking involved  
> > 
> > I believe it is ok.
> 
> Ok.
>  
> > > Our goal is to reach a high packing density of realtime VMs. Our
> > > pragmatic first choice was to run all non-vcpu-threads on a shared
> > > set of pcpus where we also run best-effort VMs and host load.
> > > Now the OpenStack guys are not too happy with that because that is
> > > load outside the assigned resources, which leads to quota and
> > > accounting problems.
> > > 
> > > So the current OpenStack model is to run those threads next to one
> > > or more vcpu-threads. [1] You will need to remember that the vcpus
> > > in question should not be your rt-cpus in the guest. I.e. if vcpu0
> > > shares its pcpu with the hypervisor noise your preemptrt-guest
> > > would use isolcpus=1.
> > > 
> > > Is that kind of sharing a pcpu really a good idea? I could imagine
> > > things like smp housekeeping (cache invalidation etc.) to eventually
> > > cause vcpu1 having to wait for the emulator stuck in IO.  
> > 
> > Agreed. IIRC, in the beginning of KVM-RT we saw a problem where
> > running vcpu0 on an non-isolated core and without FIFO priority
> > caused spikes in vcpu1. I guess we debugged this down to vcpu1
> > waiting a few dozen microseconds for vcpu0 for some reason. Running
> > vcpu0 on a isolated core with FIFO priority fixed this (again, this
> > was years ago, I won't remember all the details).
> > 
> > > Or maybe a busy polling vcpu0 starving its own emulator causing high
> > > latency or even deadlocks.  
> > 
> > This will probably happen if you run vcpu0 with FIFO priority.
> 
> Two more points that indicate that hw:cpu_realtime_mask (putting
> emulators/io next to any vcpu) does not work for general rt.
> 
> > > Even if it happens to work for Linux guests it seems like a strong
> > > assumption that an rt-guest that has noise cores can deal with even
> > > more noise one scheduling level below.
> > > 
> > > More recent proposals [2] suggest a scheme where the emulator and io
> > > threads are on a separate core. That sounds more reasonable /
> > > conservative but dramatically increases the per VM cost. And the
> > > pcpus hosting the hypervisor threads will probably be idle most of
> > > the time.  
> > 
> > I don't know how to solve this problem. Maybe if we dedicate only one
> > core for all emulator threads and io-threads of a VM would mitigate
> > this? Of course we'd have to test it to see if this doesn't give
> > spikes.
> 
> [2] suggests exactly that but it is a waste of pcpus. Say a vcpu needs
> 1.0 cores and all other threads need 0.05 cores. The real need of a 1
> core rt-vm would be 1.05 for two it would be 2.05.
> With [1] we pack 2.05 onto 2 pcpus, that does not work. With [2] we
> need 3 and waste 0.95.
> 
> > > I guess in this context the most important question is whether qemu
> > > is ever involved in "regular operation" if you avoid the obvious IO
> > > problems on your critical path.
> > > 
> > > My guess is that just [1] has serious hidden latency problems and
> > > [2] is taking it a step too far by wasting whole cores for idle
> > > emulators. We would like to suggest some other way inbetween, that
> > > is a little easier on the core count. Our current solution seems to
> > > work fine but has the mentioned quota problems.  
> > 
> > What is your solution?
> 
> We have a kilo-based prototype that introduced emulator_pin_set in
> nova.conf. All vcpu threads will be scheduled on vcpu_pin_set and
> emulators and IO of all VMs will share emulator_pin_set.
> vcpu_pin_set contains isolcpus from the host and emulator_pin_set
> contains best-effort cores from the host.
> That basically means you put all emulators and io of all VMs onto a set
> of cores that the host potentially also uses for other stuff. Sticking
> with the made up numbers from above, all the 0.05s can share pcpus.
> 
> With the current implementation in mitaka (hw:cpu_realtime_mask) you
> can not have a single-core rt-vm because you can not put 1.05 into 1
> without overcommitting. You can put 2.05 into 2 but as you confirmed
> the overcommitted core could still slow down the truly exclusive one.
> On a 4-core host you get a maximum of 1 rt-VMs (2-3 cores).
> 
> With [2], which is not implemented yet, the overcommitting is avoided.
> But now you waste a lot of pcpus. 1.05 = 2, 2.05 = 3
> On a 4-core host you get a maximum of 1 rt-VMs (1-2 cores).
> 
> With our approach it might be hard to account for emulator and
> io-threads because they share pcpus. But you do not run into
> overcommitting and don't waste pcpus at the same time.
> On a 4-core host you get a maximum of 3 rt-VMs (1 core), 1 rt-VMs (2-3
> cores)

I think your solution is good.

In Linux RT context, and as you mentioned, the non-RT vCPU can acquire
some guest kernel lock, then be pre-empted by emulator thread while
holding this lock. This situation blocks RT vCPUs from doing its
work. So that is why we have implemented [2]. For DPDK I don't think
we have such problems because it's running in userland.

So for DPDK context I think we could have a mask like we have for RT
and basically considering vCPU0 to handle best effort works (emulator
threads, SSH...). I think it's the current pattern used by DPDK users.

For RT we have to isolate the emulator threads to an additional pCPU
per guests or as your are suggesting to a set of pCPUs for all the
guests running.

I think we should introduce a new option:

  - hw:cpu_emulator_threads_mask=^1

If on 'nova.conf' - that mask will be applied to the set of all host
CPUs (vcpu_pin_set) to basically pack the emulator threads of all VMs
running here (useful for RT context).

If on flavor extra-specs It will be applied to the vCPUs dedicated for
the guest (useful for DPDK context).

s.

> Henning
> 
> > > With this mail i am hoping to collect some constraints to derive a
> > > suggestion from. Or maybe collect some information that could be
> > > added to the current blueprints as reasoning/documentation.
> > > 
> > > Sorry if you receive this mail a second time, i was not subscribed
> > > to openstack-dev the first time.
> > > 
> > > best regards,
> > > Henning
> > > 
> > > [1]
> > > https://specs.openstack.org/openstack/nova-specs/specs/mitaka/implemented/libvirt-real-time.html
> > > [2]
> > > https://specs.openstack.org/openstack/nova-specs/specs/ocata/approved/libvirt-emulator-threads-policy.html
> > > [3]
> > > http://events.linuxfoundation.org/sites/events/files/slides/kvmforum2015-realtimekvm.pdf
> > >   
> > 

Open Stack

[openstack-dev] realtime kvm cpu affinities

OpenStack

Community

Documentation

Branding & Legal