[openstack-dev] [nova][scheduling] Can VM placement consider the VM network traffic need?

Mooney, Sean K sean.k.mooney at intel.com
Tue Sep 5 12:15:42 UTC 2017


Interesting timeing
Would love to talk about this at the ptg.
Comments inline.
Regards
sean

> -----Original Message-----
> From: Balazs Gibizer [mailto:balazs.gibizer at ericsson.com]
> Sent: Tuesday, September 5, 2017 8:23 AM
> To: OpenStack Development Mailing List (not for usage questions)
> <openstack-dev at lists.openstack.org>
> Cc: Mooney, Sean K <sean.k.mooney at intel.com>; moshele at mellanox.com
> Subject: Re: [openstack-dev] [nova][scheduling] Can VM placement
> consider the VM network traffic need?
> 
> On Mon, Sep 4, 2017 at 9:11 PM, Jay Pipes <jaypipes at gmail.com> wrote:
> > On 09/01/2017 04:42 AM, Rua, Philippe (Nokia - FI/Espoo) wrote:
> > > Will it be possible to include network bandwidth as a resource in
> > Nova scheduling, for VM placement decision?
> >
> > Yes.
> >
> > See here for a related Neutron spec that mentions Placement:
> > https://review.openstack.org/#/c/396297/7/specs/pike/strict-minimum-
> ba
> > ndwidth-support.rst
> >
> > > Context: in telecommunication applications, the network traffic is
> > an important dimension of resource usage. For example, it is often
> > important to distribute "bandwidth-greedy" VMs to different compute
> > nodes. There were some earlier discussions on this topic, but I could
> > not find a concrete outcome. [1][2][3]
> > >
> > > After some reading, I wonder whether the Custom resource classes
> > can provide a generic mechanism? [4][5][6]
> >
> > No :) Custom resource classes are antithetical to generic/standard
> > mechanisms.
> >
> > We want to add two *standard* resource classes, one called
> > NET_INGRESS_BYTES_SEC and another called NET_EGRESS_BYTES_SEC which
> > would represent the total bandwidth in bytes per second the for
> > corresponding traffic directions.
> 
> While I agree that the end goal is to have standard resource classes
> for bandwidth I think custom resource classes are generic enough to
> model bandwidth resource. If you want to play with the bandwidth based
> scheduling idea based on Pike then custom resource classes are
> available as a tool for a proof of concept.
[Mooney, Sean K] 
Form a queens perspective Rodolfo is currently working creating a spec
To introduce a standard bandwidth resource class and resource provider.
He has opened the blueprint to track this here:
https://blueprints.launchpad.net/nova/+spec/bandwidth-resource-provider
currently the scope we are proposing our work to cover is end to end
minimum bandwidth guarantee for sriov interfaces.in this case the bandwidth
resource provider will be a child of the PF. This could be extended
to vSwitches also but in the linux bridge and ovs case neither can support
multi-tenant minimum bandwidth gurrentess at present so from a nova perspective
while we can make sure we do not over subscribe on bandwidth for ovs, neutron
cannot enforce the minimum bandwidth allocation on the vswitch. Hardware offloaded
ovs may be able to provide a minimum bandwidth guarantee in the future as might vpp
> 
> >
> >
> > What would be the resource provider, though? There are at least two
> > potential answers here:
> >
> > 1) A network interface controller on the compute host
> >
> > In this case, the NIC on the host would be a child provider of the
> > compute host resource provider. It would have an inventory record of
> > resource class NET_INGRESS_BYTES_SEC with a total value representing
> > the entire bandwidth of the host NIC. Instances would consume some
> > amount of NET_INGRESS_BYTES_SEC corresponding to *either* the Nova
> > flavor (if the resources:NET_INGRESS_BYTES_SEC extra-spec is set)
> *or*
> > to the sum of consumed bandwidth amounts from the port profile of any
> > ports specified when launching the instance (and thus would be part
> of
> > the pci device request collection attached to the build request).
> >
> > 2) A "network slice" of a network interface controller on the compute
> > host
> >
> > In this case, assume that the NIC on the compute host has had its
> > total bandwidth constrained via traffic control so that 50% of its
> > available ingress bandwidth is allocated to network A and 50% is
> > allocated to network B.
> >
> > There would be multiple resources providers, each with an inventory
> > record of resource class NET_INGRESS_BYTES_SEC with a total value of
> > 1/2
> > the total NIC bandwidth. Both of these resource providers would be
> > child providers of the compute host resource provider. One of these
> > child resource providers will be decorated with the trait
> > "CUSTOM_NETWORK_A"
> > and the other with trait "CUSTOM_NETWORK_B".
> >
> > The scheduler would be able to determine which resource provider to
> > consume the NET_INGRESS_BYTES_SEC resources from by looking for a
> > resource provider that has both the required amount of
> > NET_INGRESS_BYTES_SEC as well as the trait required by the port
> > profile.
> > If, say, the port profile specifies that the port is to go on a NIC
> > with access to network "A", then the build request would contain a
> > request to the scheduler for CUSTOM_NETWORK_A trait...
> 
> The above setup can be simulated with custom resource classes and
> individual resource providers per compute node connected to the given
> compute node's resource provider via an aggregate. You most probably
> need to simulate the above network traits with individual custom
> resource classes in Pike.
> 
> I definitely don't think it is something I would do in production based
> on Pike due to two reasons:
> 1) we have bugs in Pike GA that prevents nova to handle some edge cases
> (especially in VM moving scenarios)
> 2) I agree with Jay that nested providers and neutron support will
> allows us to do something much more cleaner in the future.
> 
> However I think Pike is a good base to build a PoC and gather feedback.
> For example I already foresee a need to model OVS packet processing
> limits and in the long run even include the capacity of the TOR
> switches into the picture.
[Mooney, Sean K] yes modeling the capacity of tor switches will eventually become
Important but I think we can make good progress in this activity with modeling just
The bandwith from a server perspective. On the ovs side we have disused internally several times
how to measure the switching capacity of ovs. "In general", I am greatly simplifying
Here you can assume your vswitch internal switch capacity will exceed your external
Bandwidth so modeling the bandwidth of the external interface on a vSwitches is a good
First step. 

I added a section to the nova ptg etherpad relating to modeling software load
In placement specifically ovs. The rational for this is that quantitivily ovs with dpdk only
Supports 1024 vhost-user ports. We have never hit this limit in the past because running 1024
Vms or best case 64 vms with 16 port each on the same host is well simply put an edge case.


Having an ovs resource provider would allow us to create child resource providers for bandwith
And also allow use to describe the qualitative aspects as traits. 
for example dpdk support,hw offload,virtio feature flags... it turns out 
the last point is rather important for livemigration.

Currently I am leaning toward the view that neutron should be the one responsible for creating
The ovs resource provider and its associated child bandwidth providers and traits, as I 
would prefer if nova did not have to care too much about networking.

I am talking to redhat about work they are driving in ovs to address live migration issue
where the vm crashes if the destination cannot support the features of the source host. 
We could model the virtio features as traits on the host resource provider but since they plan 
to expose them via the ovsdb if we had an ovs resource provider that would be a better fit.
> 
> >
> >
> > If you're coming to Denver, I encourage you to get with me, Sean
> > Mooney, Moshe Levi and others who are interested in seeing this work
> > move forward.
> 
> @Jay: sign me up for this list.
> 
> Cheers,
> gibi
> 
> >
> > Best,
> > -jay
> >
> > > Here is what I have in mind:
> > > - The VM need is specified in the flavor extra-specs, e.g.
> > resources:CUSTOM_BANDWIDTH=123.
> > > - The compute node total capacity is specified in host aggregate
> > metadata, e.g. CUSTOM_BANDWIDTH=999.
> > > - Nova then takes care of the rest: scheduling where the free
> > capacity is sufficient, and performing simple resource usage
> > accounting (updating the compute node free network bandwidth capacity
> > as required).
> > >
> > > Is the outline above according to current plans?
> > > If not, what would be possible/needed in order to achieve the same
> > result, i.e. consider the VM network traffic need during VM
> placement?
> > >
> > > BR,
> > > Philippe
> > >
> > > [1]
> > https://blueprints.launchpad.net/nova/+spec/bandwidth-as-scheduler-
> met
> > ric
> > > [2] https://wiki.openstack.org/wiki/NetworkBandwidthEntitlement
> > > [3]
> > https://openstack.nimeyo.com/80515/openstack-scheduling-bandwidth-
> reso
> > urces-nic_bw_kb-resource
> > > [4] https://docs.openstack.org/nova/latest/user/placement.html
> > > [5]
> > http://specs.openstack.org/openstack/nova-specs/priorities/pike-
> priori
> > ties.html#placement
> > > [6] https://review.openstack.org/#/c/473627/
> > >
> > >
> >
> ______________________________________________________________________
> > ____
> > > OpenStack Development Mailing List (not for usage questions)
> > > Unsubscribe:
> > OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> > >
> >
> >
> ______________________________________________________________________
> > ____ OpenStack Development Mailing List (not for usage questions)
> > Unsubscribe:
> > OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



More information about the OpenStack-dev mailing list