[openstack-dev] realtime kvm cpu affinities

Henning Schild henning.schild at siemens.com
Thu Jun 29 10:51:15 UTC 2017


Am Wed, 28 Jun 2017 11:34:42 +0200
schrieb Sahid Orentino Ferdjaoui <sferdjao at redhat.com>:

> On Tue, Jun 27, 2017 at 04:00:35PM +0200, Henning Schild wrote:
> > Am Tue, 27 Jun 2017 09:44:22 +0200
> > schrieb Sahid Orentino Ferdjaoui <sferdjao at redhat.com>:
> >   
> > > On Mon, Jun 26, 2017 at 10:19:12AM +0200, Henning Schild wrote:  
> > > > Am Sun, 25 Jun 2017 10:09:10 +0200
> > > > schrieb Sahid Orentino Ferdjaoui <sferdjao at redhat.com>:
> > > >     
> > > > > On Fri, Jun 23, 2017 at 10:34:26AM -0600, Chris Friesen
> > > > > wrote:    
> > > > > > On 06/23/2017 09:35 AM, Henning Schild wrote:      
> > > > > > > Am Fri, 23 Jun 2017 11:11:10 +0200
> > > > > > > schrieb Sahid Orentino Ferdjaoui
> > > > > > > <sferdjao at redhat.com>:      
> > > > > >       
> > > > > > > > In Linux RT context, and as you mentioned, the non-RT
> > > > > > > > vCPU can acquire some guest kernel lock, then be
> > > > > > > > pre-empted by emulator thread while holding this lock.
> > > > > > > > This situation blocks RT vCPUs from doing its work. So
> > > > > > > > that is why we have implemented [2]. For DPDK I don't
> > > > > > > > think we have such problems because it's running in
> > > > > > > > userland.
> > > > > > > > 
> > > > > > > > So for DPDK context I think we could have a mask like we
> > > > > > > > have for RT and basically considering vCPU0 to handle
> > > > > > > > best effort works (emulator threads, SSH...). I think
> > > > > > > > it's the current pattern used by DPDK users.      
> > > > > > > 
> > > > > > > DPDK is just a library and one can imagine an application
> > > > > > > that has cross-core communication/synchronisation needs
> > > > > > > where the emulator slowing down vpu0 will also slow down
> > > > > > > vcpu1. You DPDK application would have to know which of
> > > > > > > its cores did not get a full pcpu.
> > > > > > > 
> > > > > > > I am not sure what the DPDK-example is doing in this
> > > > > > > discussion, would that not just be cpu_policy=dedicated? I
> > > > > > > guess normal behaviour of dedicated is that emulators and
> > > > > > > io happily share pCPUs with vCPUs and you are looking for
> > > > > > > a way to restrict emulators/io to a subset of pCPUs
> > > > > > > because you can live with some of them beeing not
> > > > > > > 100%.      
> > > > > > 
> > > > > > Yes.  A typical DPDK-using VM might look something like
> > > > > > this:
> > > > > > 
> > > > > > vCPU0: non-realtime, housekeeping and I/O, handles all
> > > > > > virtual interrupts and "normal" linux stuff, emulator runs
> > > > > > on same pCPU vCPU1: realtime, runs in tight loop in
> > > > > > userspace processing packets vCPU2: realtime, runs in tight
> > > > > > loop in userspace processing packets vCPU3: realtime, runs
> > > > > > in tight loop in userspace processing packets
> > > > > > 
> > > > > > In this context, vCPUs 1-3 don't really ever enter the
> > > > > > kernel, and we've offloaded as much kernel work as possible
> > > > > > from them onto vCPU0.  This works pretty well with the
> > > > > > current system. 
> > > > > > > > For RT we have to isolate the emulator threads to an
> > > > > > > > additional pCPU per guests or as your are suggesting to
> > > > > > > > a set of pCPUs for all the guests running.
> > > > > > > > 
> > > > > > > > I think we should introduce a new option:
> > > > > > > > 
> > > > > > > >    - hw:cpu_emulator_threads_mask=^1
> > > > > > > > 
> > > > > > > > If on 'nova.conf' - that mask will be applied to the
> > > > > > > > set of all host CPUs (vcpu_pin_set) to basically pack
> > > > > > > > the emulator threads of all VMs running here (useful
> > > > > > > > for RT context).      
> > > > > > > 
> > > > > > > That would allow modelling exactly what we need.
> > > > > > > In nova.conf we are talking absolute known values, no need
> > > > > > > for a mask and a set is much easier to read. Also using
> > > > > > > the same name does not sound like a good idea.
> > > > > > > And the name vcpu_pin_set clearly suggest what kind of
> > > > > > > load runs here, if using a mask it should be called
> > > > > > > pin_set.      
> > > > > > 
> > > > > > I agree with Henning.
> > > > > > 
> > > > > > In nova.conf we should just use a set, something like
> > > > > > "rt_emulator_vcpu_pin_set" which would be used for running
> > > > > > the emulator/io threads of *only* realtime instances.      
> > > > > 
> > > > > I'm not agree with you, we have a set of pCPUs and we want to
> > > > > substract some of them for the emulator threads. We need a
> > > > > mask. The only set we need is to selection which pCPUs Nova
> > > > > can use (vcpus_pin_set).    
> > > > 
> > > > At that point it does not really matter whether it is a set or a
> > > > mask. They can both express the same where a set is easier to
> > > > read/configure. With the same argument you could say that
> > > > vcpu_pin_set should be a mask over the hosts pcpus.
> > > > 
> > > > As i said before: vcpu_pin_set should be renamed because all
> > > > sorts of threads are put here (pcpu_pin_set?). But that would
> > > > be a bigger change and should be discussed as a seperate issue.
> > > > 
> > > > So far we talked about a compute-node for realtime only doing
> > > > realtime. In that case vcpu_pin_set + emulator_io_mask would
> > > > work. If you want to run regular VMs on the same host, you can
> > > > run a second nova, like we do.
> > > > 
> > > > We could also use vcpu_pin_set + rt_vcpu_pin_set(/mask). I think
> > > > that would allow modelling all cases in just one nova. Having
> > > > all in one nova, you could potentially repurpose rt cpus to
> > > > best-effort and back. Some day in the future ...    
> > > 
> > > That is not something we should allow or at least
> > > advertise. compute-node can't run both RT and non-RT guests and
> > > that because the nodes should have a kernel RT. We can't
> > > guarantee RT if both are on same nodes.  
> > 
> > An RT capable kernel can run best-effort applications just fine, so
> > you can run regular and RT VMs on such a host. At the moment we use
> > two novas on one host, but are still having trouble to configure
> > that for mitaka.  
> 
> Sure RT kernel can run non-RT VMs but you also have to configure the
> host, route the device interrupts on CPUs which do not take part of
> the one for RT, change the rcuc kernel threads priority, exclude the
> isolated CPUs from the writeback workqueue... and a bunch of other
> things where Nova does not have the scheduling granularity to take
> that into account.

Exactly, that is complex and all pCPUs that you configured like that go
into your vcpu_pin_set_rt or into vcpu_pin_set of your rt-nova.

All the other pCPUs can be given to another nova or both sets can be
configured in one. I would like to see both in one. In that case you
could even imagine doing all the tuning on-demand and being dynamic
with the sets some day.

> So even if it is possible to spawn non-RT VMs I don't think we want
> to support such scenario.

I think we need to support a scenario where one machine hosts both
kinds of VMs. When thinking of rt-OpenStack deployments you should not
just think big but also small. A handfull or even less compute nodes.
Realtime means my compute is physically close to a physical process i
need to control. So not your big datacentre that is far away from
everywhere. But smallish Compute-Racks distributed all over the place.

> > As far as i remember it was not straight forward to get two novas
> > onto one host in the older release, i am not surprised that causing
> > trouble with the update to mitaka. If we agree on 2 novas and
> > aggregates as the recommended way we should make sure the 2 novas
> > is a supported feature, covered in test-cases and documented.
> > Dedicating a whole machine to either RT or nonRT would imho be no
> > viable option.
> >    
> > > The realtime nodes should be isolated by aggregates as you seem
> > > to do.  
> > 
> > Yes, with two novas on one machine. They share one libvirt using
> > different instrance-prefixes and have some other config options
> > set, so they do not collide on resources.  
> 
> It's clearly not what I was suggesting, you should have 2 groups of
> compute hosts. One aggregate with hosts for the non-RT VMs and an
> other one for hosts with RT VMs.

Ok, but that is what people currently need to do if they have one
machine hosting both kinds of VMs. Which - i have to stress it again -
is an important use-case.

Henning

> > > > > > We may also want to have "rt_emulator_overcommit_ratio" to
> > > > > > control how many threads/instances we allow per pCPU.      
> > > > > 
> > > > > Not really sure to have understand this point? If it is to
> > > > > indicate that for a pCPU isolated we want X guest emulator
> > > > > threads, the same behavior is achieved by the mask. A host for
> > > > > realtime is dedicated for realtime, no overcommitment and the
> > > > > operators know the number of host CPUs, they can easily
> > > > > deduct a ratio and so the corresponding mask.    
> > > > 
> > > > Agreed.
> > > >     
> > > > > > > > If on flavor extra-specs It will be applied to the vCPUs
> > > > > > > > dedicated for the guest (useful for DPDK context).      
> > > > > > > 
> > > > > > > And if both are present the flavor wins and nova.conf is
> > > > > > > ignored?      
> > > > > > 
> > > > > > In the flavor I'd like to see it be a full bitmask, not an
> > > > > > exclusion mask with an implicit full set.  Thus the end-user
> > > > > > could specify "hw:cpu_emulator_threads_mask=0" and get the
> > > > > > emulator threads to run alongside vCPU0.      
> > > > > 
> > > > > Same here, I'm not agree, the only set is the vCPUs of the
> > > > > guest. Then we want a mask to substract some of them.    
> > > > 
> > > > The current mask is fine, but using the same name in nova.conf
> > > > and in the flavor does not seem like a good idea.    
> > > 
> > > I do not see any problem with that, only operators are going to
> > > set this option on nova.conf or flavor extra-specs.
> > >
> > > I think we are agree on the general aspect. I'm going to update
> > > the current spec for Q and see whether we could make it.  
> > 
> > Cool. In the meantime we are working on an implementation as patch
> > on mitaka. Lets see if we hit unexpected cases we did not yet
> > consider.
> > 
> > Henning
> >    
> > > s.
> > >   
> > > > Henning
> > > >     
> > > > > > Henning, there is no conflict, the nova.conf setting and the
> > > > > > flavor setting are used for two different things.
> > > > > > 
> > > > > > Chris
> > > > > > 
> > > > > > __________________________________________________________________________
> > > > > > OpenStack Development Mailing List (not for usage questions)
> > > > > > Unsubscribe:
> > > > > > OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> > > > > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev      
> > > > > 
> > > > > __________________________________________________________________________
> > > > > OpenStack Development Mailing List (not for usage questions)
> > > > > Unsubscribe:
> > > > > OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> > > > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev    




More information about the OpenStack-dev mailing list