[openstack-dev] realtime kvm cpu affinities

Henning Schild henning.schild at siemens.com
Tue Jun 27 14:00:35 UTC 2017


Am Tue, 27 Jun 2017 09:44:22 +0200
schrieb Sahid Orentino Ferdjaoui <sferdjao at redhat.com>:

> On Mon, Jun 26, 2017 at 10:19:12AM +0200, Henning Schild wrote:
> > Am Sun, 25 Jun 2017 10:09:10 +0200
> > schrieb Sahid Orentino Ferdjaoui <sferdjao at redhat.com>:
> >   
> > > On Fri, Jun 23, 2017 at 10:34:26AM -0600, Chris Friesen wrote:  
> > > > On 06/23/2017 09:35 AM, Henning Schild wrote:    
> > > > > Am Fri, 23 Jun 2017 11:11:10 +0200
> > > > > schrieb Sahid Orentino Ferdjaoui <sferdjao at redhat.com>:    
> > > >     
> > > > > > In Linux RT context, and as you mentioned, the non-RT vCPU
> > > > > > can acquire some guest kernel lock, then be pre-empted by
> > > > > > emulator thread while holding this lock. This situation
> > > > > > blocks RT vCPUs from doing its work. So that is why we have
> > > > > > implemented [2]. For DPDK I don't think we have such
> > > > > > problems because it's running in userland.
> > > > > > 
> > > > > > So for DPDK context I think we could have a mask like we
> > > > > > have for RT and basically considering vCPU0 to handle best
> > > > > > effort works (emulator threads, SSH...). I think it's the
> > > > > > current pattern used by DPDK users.    
> > > > > 
> > > > > DPDK is just a library and one can imagine an application
> > > > > that has cross-core communication/synchronisation needs where
> > > > > the emulator slowing down vpu0 will also slow down vcpu1. You
> > > > > DPDK application would have to know which of its cores did
> > > > > not get a full pcpu.
> > > > > 
> > > > > I am not sure what the DPDK-example is doing in this
> > > > > discussion, would that not just be cpu_policy=dedicated? I
> > > > > guess normal behaviour of dedicated is that emulators and io
> > > > > happily share pCPUs with vCPUs and you are looking for a way
> > > > > to restrict emulators/io to a subset of pCPUs because you can
> > > > > live with some of them beeing not 100%.    
> > > > 
> > > > Yes.  A typical DPDK-using VM might look something like this:
> > > > 
> > > > vCPU0: non-realtime, housekeeping and I/O, handles all virtual
> > > > interrupts and "normal" linux stuff, emulator runs on same pCPU
> > > > vCPU1: realtime, runs in tight loop in userspace processing
> > > > packets vCPU2: realtime, runs in tight loop in userspace
> > > > processing packets vCPU3: realtime, runs in tight loop in
> > > > userspace processing packets
> > > > 
> > > > In this context, vCPUs 1-3 don't really ever enter the kernel,
> > > > and we've offloaded as much kernel work as possible from them
> > > > onto vCPU0.  This works pretty well with the current system.
> > > >     
> > > > > > For RT we have to isolate the emulator threads to an
> > > > > > additional pCPU per guests or as your are suggesting to a
> > > > > > set of pCPUs for all the guests running.
> > > > > > 
> > > > > > I think we should introduce a new option:
> > > > > > 
> > > > > >    - hw:cpu_emulator_threads_mask=^1
> > > > > > 
> > > > > > If on 'nova.conf' - that mask will be applied to the set of
> > > > > > all host CPUs (vcpu_pin_set) to basically pack the emulator
> > > > > > threads of all VMs running here (useful for RT context).    
> > > > > 
> > > > > That would allow modelling exactly what we need.
> > > > > In nova.conf we are talking absolute known values, no need
> > > > > for a mask and a set is much easier to read. Also using the
> > > > > same name does not sound like a good idea.
> > > > > And the name vcpu_pin_set clearly suggest what kind of load
> > > > > runs here, if using a mask it should be called pin_set.    
> > > > 
> > > > I agree with Henning.
> > > > 
> > > > In nova.conf we should just use a set, something like
> > > > "rt_emulator_vcpu_pin_set" which would be used for running the
> > > > emulator/io threads of *only* realtime instances.    
> > > 
> > > I'm not agree with you, we have a set of pCPUs and we want to
> > > substract some of them for the emulator threads. We need a mask.
> > > The only set we need is to selection which pCPUs Nova can use
> > > (vcpus_pin_set).  
> > 
> > At that point it does not really matter whether it is a set or a
> > mask. They can both express the same where a set is easier to
> > read/configure. With the same argument you could say that
> > vcpu_pin_set should be a mask over the hosts pcpus.
> > 
> > As i said before: vcpu_pin_set should be renamed because all sorts
> > of threads are put here (pcpu_pin_set?). But that would be a bigger
> > change and should be discussed as a seperate issue.
> > 
> > So far we talked about a compute-node for realtime only doing
> > realtime. In that case vcpu_pin_set + emulator_io_mask would work.
> > If you want to run regular VMs on the same host, you can run a
> > second nova, like we do.
> > 
> > We could also use vcpu_pin_set + rt_vcpu_pin_set(/mask). I think
> > that would allow modelling all cases in just one nova. Having all
> > in one nova, you could potentially repurpose rt cpus to best-effort
> > and back. Some day in the future ...  
> 
> That is not something we should allow or at least
> advertise. compute-node can't run both RT and non-RT guests and that
> because the nodes should have a kernel RT. We can't guarantee RT if
> both are on same nodes.

An RT capable kernel can run best-effort applications just fine, so you
can run regular and RT VMs on such a host. At the moment we use two
novas on one host, but are still having trouble to configure that for
mitaka.
As far as i remember it was not straight forward to get two novas onto
one host in the older release, i am not surprised that causing trouble
with the update to mitaka. If we agree on 2 novas and aggregates as the
recommended way we should make sure the 2 novas is a supported feature,
covered in test-cases and documented.
Dedicating a whole machine to either RT or nonRT would imho be no
viable option.
 
> The realtime nodes should be isolated by aggregates as you seem to do.

Yes, with two novas on one machine. They share one libvirt using
different instrance-prefixes and have some other config options set, so
they do not collide on resources.

> > > > We may also want to have "rt_emulator_overcommit_ratio" to
> > > > control how many threads/instances we allow per pCPU.    
> > > 
> > > Not really sure to have understand this point? If it is to
> > > indicate that for a pCPU isolated we want X guest emulator
> > > threads, the same behavior is achieved by the mask. A host for
> > > realtime is dedicated for realtime, no overcommitment and the
> > > operators know the number of host CPUs, they can easily deduct a
> > > ratio and so the corresponding mask.  
> > 
> > Agreed.
> >   
> > > > > > If on flavor extra-specs It will be applied to the vCPUs
> > > > > > dedicated for the guest (useful for DPDK context).    
> > > > > 
> > > > > And if both are present the flavor wins and nova.conf is
> > > > > ignored?    
> > > > 
> > > > In the flavor I'd like to see it be a full bitmask, not an
> > > > exclusion mask with an implicit full set.  Thus the end-user
> > > > could specify "hw:cpu_emulator_threads_mask=0" and get the
> > > > emulator threads to run alongside vCPU0.    
> > > 
> > > Same here, I'm not agree, the only set is the vCPUs of the guest.
> > > Then we want a mask to substract some of them.  
> > 
> > The current mask is fine, but using the same name in nova.conf and
> > in the flavor does not seem like a good idea.  
> 
> I do not see any problem with that, only operators are going to set
> this option on nova.conf or flavor extra-specs.
>
> I think we are agree on the general aspect. I'm going to update the
> current spec for Q and see whether we could make it.

Cool. In the meantime we are working on an implementation as patch on
mitaka. Lets see if we hit unexpected cases we did not yet consider.

Henning
 
> s.
> 
> > Henning
> >   
> > > > Henning, there is no conflict, the nova.conf setting and the
> > > > flavor setting are used for two different things.
> > > > 
> > > > Chris
> > > > 
> > > > __________________________________________________________________________
> > > > OpenStack Development Mailing List (not for usage questions)
> > > > Unsubscribe:
> > > > OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> > > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev    
> > > 
> > > __________________________________________________________________________
> > > OpenStack Development Mailing List (not for usage questions)
> > > Unsubscribe:
> > > OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev  




More information about the OpenStack-dev mailing list