[cyborg][nova] Support flexible use scenario for Mdev(such as vGPU)

Feng, Shaohe shaohe.feng at intel.com
Tue Jun 23 12:25:57 UTC 2020

-----Original Message-----
From: Sean Mooney <smooney at redhat.com> 
Sent: 2020年6月23日 19:48
To: Feng, Shaohe <shaohe.feng at intel.com>; openstack-discuss at lists.openstack.org
Cc: yumeng_bao at yahoo.com; shhfeng at 126.com
Subject: Re: [cyborg][nova] Support flexible use scenario for Mdev(such as vGPU)

On Tue, 2020-06-23 at 05:50 +0000, Feng, Shaohe wrote:
> Hi all,
> Currently openstack support vGPU as follow:
> https://docs.openstack.org/nova/latest/admin/virtual-gpu.html
> In order to support it, admin should plan ahead and configure the vGPU before deployable as follow:
> https://docs.openstack.org/nova/latest/configuration/config.html#devic
> es.enabled_vgpu_types This is very inconvenient for the administrator, 
> this method has a limitation that a same PCI address does not provide 
> two different types.

that is a matter of perspective there are those that prefer to check all fo there configuration into git and have a declaritive deployment and those that wish to drive everything via the api for the latter having to confiugre availbe resouce via cybogs api or placemtn would be consider very inconvenient.

> Cyborg as an accelerator management tool is more suitable for mdev device management.
maybe but maybe not.
i do think that cyborg should support mdevs i do not think we should have a dedicated vgpu mdev driver however.
i think we should crate a stateless mdev driver that uses a similar whitelist of allowed mdevtypes and devices.

we did breifly discuss adding generic mdev support to nova in a future releae (w) or if we shoudl delegate that to cyborg but i would hesitate to do that if cyborg continutes its current design where driver have no configuraton element to whitelist device as it makes it much harder for deployment tools to properly configure a host declaritivly.
[Feng, Shaohe] 
We did support config, such as our demo for fpga pre-program, we support config for our new drivers.
And such as other accelerators, maybe the infra also need accelerators for acceleration not only VM needs. 
For example, cinder can use QAT for compress/crypto, and VM also can QAT. 
We need to configure which QATs are for infra and which for VMs.

> One solution as follow:
> Firstly, we need a vender driver(this can be a plugin),  it is used to 
> discovery its special devices and report them to placement for schedule.
> The difference from the current implementation is that:
> 1. report the mdev_supported_types as traits to resource provider.
> How to discover a GPU type:
> $ ls /sys/class/mdev_bus/*/mdev_supported_types
> /sys/class/mdev_bus/0000:84:00.0/mdev_supported_types:
> nvidia-35  nvidia-36  nvidia-37  nvidia-38  nvidia-39  nvidia-40  
> nvidia-41  nvidia-42  nvidia-43  nvidia-44  nvidia-
> 45
> so here we report nvidia-3*, nvidia-4* as traits to resource provider.
> 2. Report the number of allocable resources instead of vGPU unit 
> numbers to resource provider inventory Example for the NVidia V100 PCIe card (one GPU per board) :
> Virtual GPU Type     Frame Buffer (Gbytes)      Maximum vGPUs per GPU        Maximum vGPUs per Board
> V100D-32Q               32                        1                            1
> V100D-16Q               16                        2                            2
> V100D-8Q               8                         4                            4
> V100D-4Q               4                          8                            8
> V100D-2Q               2                         16                           16
> V100D-1Q               1                         32                           32
> so here we report 32G Buffers(an example, maybe other resources) to 
> resource provider inventory

in this specific example that would not be a good idea.
the V100 does not support mixing mdev types on the same gpu so if you allocate a V100D-16Q instace using 16G of the buffer you cannot then allocate 2 V100D-8Q vgpu instance to consume the remaining 16G other mdev based device may not have this limitation but nvida only support 1 active mdevtype per phsyical gpu.

not that the ampere generation has a dynmaic sriov based muilti instance gpu technology which kind of allow resouce based subdivietion of the device but it does not quite work the way you are describing above.

so you can report inventories of custom resource classes of for each of the mdev types or a single inventory of VGPU with traits modelling the avialble mdevs.

with the trait approch before a vgpu is allocated you report all traits and the total count for the inventory would be the hightest amount e.g. 32 in this case above then when a gpu is allocated you need to update the reserved value and remove the other traits.

if you have 1 inventory per mdev type then you set reserved = total for all inventories for the other mdev types but no need for traits.
[Feng, Shaohe] 
Oh, really sorry, I should choose a good example.

> 3. driver should also support a function to create certain mdev type, 
> such as (V100D-1Q,  V100D-2Q,) Secondly, we need a mdev extend ARQ(it can be a plugin):
no it should not be a plugin it should jsut be another attachment type.
> Here is an example for fpga ext arq:
> https://review.opendev.org/#/c/681005/26/cyborg/objects/extarq/fpga_ex
> t_arq.py at 206 The difference is that, we replace the _do_programming to 
> _do_create_mdev For _do_programming, it is used to create a new FPGA 
> function.
> For _do_create_mdev, it is used to create a new type mdev, it will call the implementation function in vendor driver.
well not really it will echo a uuid into a file in sysfs, that will triger the vendor driver to create the mdev but cyborg is not actully linking to the vendor driver and invoking a c function directly.

> At last we need to support a mdev handler for xml generation in nova, 
> we can refer to the cyborg PCI handler in nova
yes i brought this up in the nova ptg as something we will ikely need to add supprot for. as i was suggesting above this can be done by adding a new attachment type which is mdev that jsut contains the uuid instead of the pci address.

> So after the above changes:
> Admin can create different SLA devices profiles such as:
> {“name”: “Gold_vGPU”,
>        “groups”: [
> {“resources:vGPU_BUFFERS”: “16”,
>         “traits: V100D-16Q,”: “required”,
>               }]
> }
> And
> {“name”: “Iron_vGPU”,
>        “groups”: [
> {“resources:vGPU_BUFFERS”: “1”,
>         “traits: V100D-1Q,”: “required”,
>               }]
> }
> Then a tenant can use Gold_vGPU to create with a VM with V100D-16Q 
> vGPU And another tenant can use Iron_vGPU to create with a VM with 
> V100D-1Q vGPU
it cannot do this on the same physical gpus but yes that could work or you could do
{“name”: “Gold_vGPU”, “groups”: [{“resources:CUSTOM_V100D-16Q”: “1”,}]}

currently for nova we just do resouces:vgpu and you can optionally do trait:CUSTOM_

by the way  we cannot use standard resouce classes or raits for the mdev types as these are arbitary stings chosen by vendor that can potential change based on kernel of driver version so we shoudl not add them to os-traits or os-resouce-classes and in stead shoudl use CUSTOM_ resouce classes ro traits for them.
[Feng, Shaohe] 
Yes use use CUSTOM_ for them.
> When ARQ binding during the VM creating, the Cyborg will call the vendor driver to create expected mdev vGPU.
> And these 2 mdev vGPU can be on same physical GPU card.
> The mdev extend ARQ and vendor driver can be plugin, they are loose couple with the upstream code.
it should not be a plugin its a generic virtualisation attachment mode that can be used by any device we shoudl standardise the attachemt handel in core cyborg and add support for that attachmet model in nova.
we already have support for generating the mdev xml so we would only need to wire up the handeling for the attachment type in the code that currently handles the pci attahcmeht type.
[Feng, Shaohe] 
Yes, we will support stand generic ARQ.
Only extend ARQ for some special accelerators, FPGA is an example.
> So the downstream can get the upstream code to customize the own mdev extend ARQ and vendor driver.
> Here vGPU is just an example, it can be other mdev devices.

yes so because this can be used for other device that is why i would counter propose that we should create a stateless mdev driver to cover devices that do not require programming or state managment and have a config driven interface to map mdev types to custom resouce classes and/or triats and we also need to declare per device it it support independent pools of each mdev type or if they consome the same ressource. .i.e. to it work like nvidea's v100s where you can only have 1 mdev type per physical device or if it works like the same device where you can have 1 device and multiple mdev types that can be consumed in parallel.

both approches are valid although i personally prefer when they are independent pools since that is eaier to reason about.

you could also support a non config driven approch where we use atibutes on the deployable to describe the mapping of the mdev type to resouce class and if its independetly consumable too i guess but that seams much more combersome to manage.
[Feng, Shaohe] 
Both stateless mdev driver and non config driven can be support. 
If these cannot satisfy the users, users add their special mdev driver by themselves. 

> BR
> Shaohe Feng

More information about the openstack-discuss mailing list