[cyborg][nova] Support flexible use scenario for Mdev(such as vGPU)

Sean Mooney smooney at redhat.com
Tue Jun 23 13:38:09 UTC 2020


On Tue, 2020-06-23 at 12:25 +0000, Feng, Shaohe wrote:
> 
> -----Original Message-----
> From: Sean Mooney <smooney at redhat.com> 
> Sent: 2020年6月23日 19:48
> To: Feng, Shaohe <shaohe.feng at intel.com>; openstack-discuss at lists.openstack.org
> Cc: yumeng_bao at yahoo.com; shhfeng at 126.com
> Subject: Re: [cyborg][nova] Support flexible use scenario for Mdev(such as vGPU)
> 
> On Tue, 2020-06-23 at 05:50 +0000, Feng, Shaohe wrote:
> > Hi all,
> > 
> > Currently openstack support vGPU as follow:
> > https://docs.openstack.org/nova/latest/admin/virtual-gpu.html
> > 
> > In order to support it, admin should plan ahead and configure the vGPU before deployable as follow:
> > https://docs.openstack.org/nova/latest/configuration/config.html#devic
> > es.enabled_vgpu_types This is very inconvenient for the administrator, 
> > this method has a limitation that a same PCI address does not provide 
> > two different types.
> 
> that is a matter of perspective there are those that prefer to check all fo there configuration into git and have a
> declaritive deployment and those that wish to drive everything via the api for the latter having to confiugre availbe
> resouce via cybogs api or placemtn would be consider very inconvenient.
> 
> > 
> > Cyborg as an accelerator management tool is more suitable for mdev device management.
> 
> maybe but maybe not.
> i do think that cyborg should support mdevs i do not think we should have a dedicated vgpu mdev driver however.
> i think we should crate a stateless mdev driver that uses a similar whitelist of allowed mdevtypes and devices.
> 
> we did breifly discuss adding generic mdev support to nova in a future releae (w) or if we shoudl delegate that to
> cyborg but i would hesitate to do that if cyborg continutes its current design where driver have no configuraton
> element to whitelist device as it makes it much harder for deployment tools to properly configure a host declaritivly.
> [Feng, Shaohe] 
> We did support config, such as our demo for fpga pre-program, we support config for our new drivers.
> And such as other accelerators, maybe the infra also need accelerators for acceleration not only VM needs. 
> For example, cinder can use QAT for compress/crypto, and VM also can QAT. 
> We need to configure which QATs are for infra and which for VMs.
yes qat is a good examlple of where shareign between host useage(cinder service) and guest usage(vms) could be required
the sam could be true of gpus. typically many servers run headleas but not always and sometime you will want to resrve
a gpu for the host to use. nics are another good example wehen we look at generic cpi passthough we need to select whic
nics will be used by the vms and which will be used for host connectivity or fo hardware offloed ovs.

> 
> 
> > 
> > One solution as follow:
> > Firstly, we need a vender driver(this can be a plugin),  it is used to 
> > discovery its special devices and report them to placement for schedule.
> > The difference from the current implementation is that:
> > 1. report the mdev_supported_types as traits to resource provider.
> > How to discover a GPU type:
> > $ ls /sys/class/mdev_bus/*/mdev_supported_types
> > /sys/class/mdev_bus/0000:84:00.0/mdev_supported_types:
> > nvidia-35  nvidia-36  nvidia-37  nvidia-38  nvidia-39  nvidia-40  
> > nvidia-41  nvidia-42  nvidia-43  nvidia-44  nvidia-
> > 45
> > so here we report nvidia-3*, nvidia-4* as traits to resource provider.
> > 2. Report the number of allocable resources instead of vGPU unit 
> > numbers to resource provider inventory Example for the NVidia V100 PCIe card (one GPU per board) :
> > Virtual GPU Type     Frame Buffer (Gbytes)      Maximum vGPUs per GPU        Maximum vGPUs per Board
> > V100D-32Q               32                        1                            1
> > V100D-16Q               16                        2                            2
> > V100D-8Q               8                         4                            4
> > V100D-4Q               4                          8                            8
> > V100D-2Q               2                         16                           16
> > V100D-1Q               1                         32                           32
> > so here we report 32G Buffers(an example, maybe other resources) to 
> > resource provider inventory
> 
> in this specific example that would not be a good idea.
> the V100 does not support mixing mdev types on the same gpu so if you allocate a V100D-16Q instace using 16G of the
> buffer you cannot then allocate 2 V100D-8Q vgpu instance to consume the remaining 16G other mdev based device may not
> have this limitation but nvida only support 1 active mdevtype per phsyical gpu.
> 
> not that the ampere generation has a dynmaic sriov based muilti instance gpu technology which kind of allow resouce
> based subdivietion of the device but it does not quite work the way you are describing above.
> 
> so you can report inventories of custom resource classes of for each of the mdev types or a single inventory of VGPU
> with traits modelling the avialble mdevs.
> 
> with the trait approch before a vgpu is allocated you report all traits and the total count for the inventory would be
> the hightest amount e.g. 32 in this case above then when a gpu is allocated you need to update the reserved value and
> remove the other traits.
> 
> if you have 1 inventory per mdev type then you set reserved = total for all inventories for the other mdev types but
> no need for traits.
> [Feng, Shaohe] 
> Oh, really sorry, I should choose a good example.
the sample mdpy kernel module which create a basic virtual graphice deice support multiple mdev type for different
resolutions https://github.com/torvalds/linux/blob/f97c81dc6ca5996560b3944064f63fc87eb18d00/samples/vfio-mdev/mdpy.c
i belive it also support consuming each mdev type independtly.
so if you dont want to use real hardware as an example ther eare at least sample devicce that support having multiple
active mdevs. i would also suggest we use this device for testing in the upstream gate.

i started creating a jobs to test novas vgpu support with this smaple device a few months back but we need to make
a few small change to make it work and i was not sure it was approate to modify the nova code just to get ci working
with a fake device. currently we make 1 assumtion about that parent of the mdev being a pci deivce
which is not true in the kernel module case. 

but form a cyborg perspective you can avoid that mistake since mdevs can be created for device on any bus like usb or
upi as well as pcie.
> 
> > 3. driver should also support a function to create certain mdev type, 
> > such as (V100D-1Q,  V100D-2Q,) Secondly, we need a mdev extend ARQ(it can be a plugin):
> 
> no it should not be a plugin it should jsut be another attachment type.
> > Here is an example for fpga ext arq:
> > https://review.opendev.org/#/c/681005/26/cyborg/objects/extarq/fpga_ex
> > t_arq.py at 206 The difference is that, we replace the _do_programming to 
> > _do_create_mdev For _do_programming, it is used to create a new FPGA 
> > function.
> > For _do_create_mdev, it is used to create a new type mdev, it will call the implementation function in vendor
> > driver.
> 
> well not really it will echo a uuid into a file in sysfs, that will triger the vendor driver to create the mdev but
> cyborg is not actully linking to the vendor driver and invoking a c function directly.
> 
> > 
> > At last we need to support a mdev handler for xml generation in nova, 
> > we can refer to the cyborg PCI handler in nova
> > 
> 
> yes i brought this up in the nova ptg as something we will ikely need to add supprot for. as i was suggesting above
> this can be done by adding a new attachment type which is mdev that jsut contains the uuid instead of the pci address.
> 
> > So after the above changes:
> > Admin can create different SLA devices profiles such as:
> > {“name”: “Gold_vGPU”,
> >        “groups”: [
> > {“resources:vGPU_BUFFERS”: “16”,
> >         “traits: V100D-16Q,”: “required”,
> >               }]
> > }
> > And
> > {“name”: “Iron_vGPU”,
> >        “groups”: [
> > {“resources:vGPU_BUFFERS”: “1”,
> >         “traits: V100D-1Q,”: “required”,
> >               }]
> > }
> > Then a tenant can use Gold_vGPU to create with a VM with V100D-16Q 
> > vGPU And another tenant can use Iron_vGPU to create with a VM with 
> > V100D-1Q vGPU
> 
> it cannot do this on the same physical gpus but yes that could work or you could do
> {“name”: “Gold_vGPU”, “groups”: [{“resources:CUSTOM_V100D-16Q”: “1”,}]}
> 
> currently for nova we just do resouces:vgpu and you can optionally do trait:CUSTOM_
> 
> by the way  we cannot use standard resouce classes or raits for the mdev types as these are arbitary stings chosen by
> vendor that can potential change based on kernel of driver version so we shoudl not add them to os-traits or os-
> resouce-classes and in stead shoudl use CUSTOM_ resouce classes ro traits for them.
> [Feng, Shaohe] 
> Yes use use CUSTOM_ for them.
> > 
> > When ARQ binding during the VM creating, the Cyborg will call the vendor driver to create expected mdev vGPU.
> > And these 2 mdev vGPU can be on same physical GPU card.
> > 
> > The mdev extend ARQ and vendor driver can be plugin, they are loose couple with the upstream code.
> 
> it should not be a plugin its a generic virtualisation attachment mode that can be used by any device we shoudl
> standardise the attachemt handel in core cyborg and add support for that attachmet model in nova.
> we already have support for generating the mdev xml so we would only need to wire up the handeling for the attachment
> type in the code that currently handles the pci attahcmeht type.
> [Feng, Shaohe] 
> Yes, we will support stand generic ARQ.
> Only extend ARQ for some special accelerators, FPGA is an example.
i am not sure we need to extend ARQ for FPGA but perhaps at some point.
nova does nto support plugins in general so as long as the info we need to request or recive does not
vary based on cyborg plugins i guess it coudl be ok but outside of the driver layer i woudl perfosnally avoid
intoduciing plugins to cyborg.
> > So the downstream can get the upstream code to customize the own mdev extend ARQ and vendor driver.
> > Here vGPU is just an example, it can be other mdev devices.
> 
> yes so because this can be used for other device that is why i would counter propose that we should create a stateless
> mdev driver to cover devices that do not require programming or state managment and have a config driven interface to
> map mdev types to custom resouce classes and/or triats and we also need to declare per device it it support
> independent pools of each mdev type or if they consome the same ressource. .i.e. to it work like nvidea's v100s where
> you can only have 1 mdev type per physical device or if it works like the same device where you can have 1 device and
> multiple mdev types that can be consumed in parallel.
> 
> 
> both approches are valid although i personally prefer when they are independent pools since that is eaier to reason
> about.
> 
> you could also support a non config driven approch where we use atibutes on the deployable to describe the mapping of
> the mdev type to resouce class and if its independetly consumable too i guess but that seams much more combersome to
> manage.
> [Feng, Shaohe] 
> Both stateless mdev driver and non config driven can be support. 
> If these cannot satisfy the users, users add their special mdev driver by themselves.
well if i put my downstream had on in terms of productization we are currntly debating if we want to include cyborg in a
futrue release of Redhat openstack plathform. at the moment it is not planed for our next major release winch is osp 17
and is in consideration for osp 18. one of the concerns we have with adding cyborg to a futre releae is the lack of a
config driven approch. it is not a blocker but haveing an api only based apporch whil it has some advantages also has
several drawback not lest of which is supportablity and day 1 and day 2 operational complexity. 

for example we have a tool that customer use when reporting bugs called sosreport which automates teh collection of logs
config and other system inforamtion. it can also invoke some command like virsh list etc but adding a new config and log
to collect is signifcantly less work then adding a new command that needs to discover the node uuid then quiry placement
and cyborg apis to determin what acclereator are avaiable and how they are configred. so api only configred service are
harder for distos to support from a day perspective when things go wrong. form a day one persepctive it is also simpler
for installer tools to template out a config file then it is to inovke commands imperitivaly against an api.

so i understand that form an operators perpesctive invoking an api to do all config managment remotely might be quite
appeling but it comes with trade offs. upstreasm should not be governed by downstream distro concerns but we should be
mindful not to make it overly hard to deploy and manage the service as it raise the barrier to integrate and if that is
too high then that can restul in not being supported in the long run. that said having a service that you just deploy
and have no configuration to do would be nice too but if we tell our user that after they deploy cyborg the must then
iterate of every deployable and decide dit they should enable it and what atibutes to add to it to schduler correctly
i think the complexity might be to high for many.
>  
> 
> 
> > 
> > BR
> > Shaohe Feng
> 
> 




More information about the openstack-discuss mailing list