[openstack-dev] vGPUs support for Nova - Implementation
Sahid Orentino Ferdjaoui
sferdjao at redhat.com
Mon Oct 2 09:43:55 UTC 2017
On Fri, Sep 29, 2017 at 11:16:43AM -0400, Jay Pipes wrote:
> Hi Sahid, comments inline. :)
>
> On 09/29/2017 04:53 AM, Sahid Orentino Ferdjaoui wrote:
> > On Thu, Sep 28, 2017 at 05:06:16PM -0400, Jay Pipes wrote:
> > > On 09/28/2017 11:37 AM, Sahid Orentino Ferdjaoui wrote:
> > > > Please consider the support of MDEV for the /pci framework which
> > > > provides support for vGPUs [0].
> > > >
> > > > Accordingly to the discussion [1]
> > > >
> > > > With this first implementation which could be used as a skeleton for
> > > > implementing PCI Devices in Resource Tracker
> > >
> > > I'm not entirely sure what you're referring to above as "implementing PCI
> > > devices in Resource Tracker". Could you elaborate? The resource tracker
> > > already embeds a PciManager object that manages PCI devices, as you know.
> > > Perhaps you meant "implement PCI devices as Resource Providers"?
> >
> > A PciManager? I know that we have a field PCI_DEVICE :) - I guess a
> > virt driver can return inventory with total of PCI devices. Talking
> > about manager, not sure.
>
> I'm referring to this:
>
> https://github.com/openstack/nova/blob/master/nova/pci/manager.py#L33
>
> [SNIP]
>
> It is that piece that Eric and myself have been talking about standardizing
> into a "generic device management" interface that would have an
> update_inventory() method that accepts a ProviderTree object [1]
Jay, all of that looks to me perfectly sane even it's not clear what
you want make so generic. That part of code is for the virt layers and
you can't make it like just considering GPU or NET as a generic piece,
they have characteristic which are requirements for virt layers.
In that method 'update_inventory(provider_tree)' which you are going
to introduce for /pci/PciManager, a first step would be to convert the
objects to a understable dict for the whole logic, right, or do you
have an other plan?
In all cases from my POV I don't see any blocker, both work can
co-exist without any pain. And adding features in the current /pci
module is not going to add heavy work but is going to give to us a
clear view of what is needed.
> [1]
> https://github.com/openstack/nova/blob/master/nova/compute/provider_tree.py
>
> and would add resource providers corresponding to devices that are made
> available to guests for use.
>
> > You still have to define "traits", basically for physical network
> > devices, the users want to select device according physical network,
> > to select device according the placement on host (NUMA), to select the
> > device according the bandwidth capability... For GPU it's same
> > story. *And I do not have mentioned devices which support virtual
> > functions.*
>
> Yes, the generic device manager would be responsible for associating traits
> to the resource providers it adds to the ProviderTree provided to it in the
> update_inventory() call.
>
> > So that is what you plan to do for this release :) - Reasonably I
> > don't think we are close to have something ready for production.
>
> I don't disagree with you that this is a huge amount of refactoring to
> undertake over the next couple releases. :)
Yes and that is the point. We are going to block the work on /pci
module during a period where we can see a large interest around such
support.
> > Jay, I have question, Why you don't start by exposing NUMA ?
>
> I believe you're asking here why we don't start by modeling NUMA nodes as
> child resource providers of the compute node? Instead of starting by
> modeling PCI devices as child providers of the compute node? If that's not
> what you're asking, please do clarify...
>
> We're starting with modeling PCI devices as child providers of the compute
> node because they are easier to deal with as a whole than NUMA nodes and we
> have the potential of being able to remove the PciPassthroughFilter from the
> scheduler in Queens.
>
> I don't see us being able to remove the NUMATopologyFilter from the
> scheduler in Queens because of the complexity involved in how coupled the
> NUMA topology resource handling is to CPU pinning, huge page support, and IO
> emulation thread pinning.
>
> Hope that answers that question; again, lemme know if that's not the
> question you were asking! :)
Yes it was the question and you perfectly responded, thanks. I will
try to be more clear in the future :)
As you have noticed the support of NUMA will be quite difficult and it
is not in the TODO right now, which let me think that we are going to
block development on pci module and more of that at the end provide
less support (no NUMA awareness). Is that reasonable ?
> > > For the record, I have zero confidence in any existing "functional" tests
> > > for NUMA, SR-IOV, CPU pinning, huge pages, and the like. Unfortunately, due
> > > to the fact that these features often require hardware that either the
> > > upstream community CI lacks or that depends on libraries, drivers and kernel
> > > versions that really aren't available to non-bleeding edge users (or users
> > > with very deep pockets).
> >
> > It's good point, if you are not confidence, don't you think it's
> > premature to move forward on implementing new thing without to have
> > well trusted functional tests?
>
> Completely agree with you. I would rather see functional integration tests
> that are proven to actually test these complex hardware devices *gating*
> Nova patches before adding any new functionality to Nova.
I plan to rewrote a bit the work initiated by Vladik (Thanks to him)
even if I think they exercising well the complexity.
> We're adding lots of functional tests of the placement and resource
> providers modeling. I could definitely use some assistance from folks with
> access to this specialized hardware to set up and maintain the CI systems
> that can provide they are actually exercising these code paths.
+1
> > > > * The Usage
> > > >
> > > > There are no difference between SR-IOV and MDEV, from operators point
> > > > of view who knows how to expose SR-IOV devices in Nova, they already
> > > > know how to expose MDEV devices (vGPUs).
> > > >
> > > > Operators will be able to expose MDEV devices in the same manner as
> > > > they expose SR-IOV:
> > > >
> > > > 1/ Configure whitelist devices
> > > >
> > > > ['{"vendor_id":"10de"}']
> > > >
> > > > 2/ Create aliases
> > > >
> > > > [{"vendor_id":"10de", "name":"vGPU"}]
> > > >
> > > > 3/ Configure the flavor
> > > >
> > > > openstack flavor set --property "pci_passthrough:alias"="vGPU:1"
> > > >
> > > > * Limitations
> > > >
> > > > The mdev does not provide 'product_id' but 'mdev_type' which should be
> > > > considered to exactly identify which resource users can request e.g:
> > > > nvidia-10. To provide that support we have to add a new field
> > > > 'mdev_type' so aliases could be something like:
> > > >
> > > > {"vendor_id":"10de", mdev_type="nvidia-10" "name":"alias-nvidia-10"}
> > > > {"vendor_id":"10de", mdev_type="nvidia-11" "name":"alias-nvidia-11"}
> > > >
> > > > I do have plan to add but first I need to have support from upstream
> > > > to continue that work.
> > >
> > > As mentioned in IRC and the previous ML discussion, my focus is on the
> > > nested resource providers work and reviews, along with the other two
> > > top-priority scheduler items (move operations and alternate hosts).
> > >
> > > I'll do my best to look at your patch series, but please note it's lower
> > > priority than a number of other items.
> >
> > No worries, the code is here, tested, fully functionnal and
> > production-ready, I made effort to make it available at the very
> > beginning of the release. With some good volitions we could fix any
> > bugs and have support for vGPUs in Queens.
>
> You cannot say it's tested, fully functional and production-ready until we
> see functional integration tests proving that :)
OK I accept that point :)
> > > One thing that would be very useful, Sahid, if you could get with Eric Fried
> > > (efried) on IRC and discuss with him the "generic device management" system
> > > that was discussed at the PTG. It's likely that the /pci module is going to
> > > be overhauled in Rocky and it would be good to have the mdev device
> > > management API requirements included in that discussion.
>
> Perhaps you missed the above part of my response. I'd like to repeat that it
> would be great to get your input on the generic device management ideas
> we've been throwing around.
Jay, I can help on this sure even if it's not still clear when you are
going to consider the characteristic of devices in your generic
management ideas :)
If I can have some explanations I could start working on reporting the
resources 'get_inventory()' and rewrote the PciManager to handle that
new 'update_from_inventory()'.
That is said for vGPUs I'm not sure about the spec you have approved
or perhaps it's a long term view. I think that you have considered the
vGPUs as dynamic resources where it can be for some hypervisor like
probably XenServer but it's not or at least not yet for libvirt/QEMU.
I think the first implementation should care only of the type of vGPU
and NUMA placement. We should have call that resource MDEV_GPU as we
have for SRIOV_NET. The operator allocate the resources of vGPUs based
on requirements and configure flavors based on type/name. That would
be a basic support.
> All the best,
> -jay
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
More information about the OpenStack-dev
mailing list