[cyborg][nova] Support flexible use scenario for Mdev(such as vGPU)

Sean Mooney smooney at redhat.com
Tue Jun 23 11:47:39 UTC 2020

On Tue, 2020-06-23 at 05:50 +0000, Feng, Shaohe wrote:
> Hi all,
> Currently openstack support vGPU as follow:
> https://docs.openstack.org/nova/latest/admin/virtual-gpu.html
> In order to support it, admin should plan ahead and configure the vGPU before deployable as follow:
> https://docs.openstack.org/nova/latest/configuration/config.html#devices.enabled_vgpu_types
> This is very inconvenient for the administrator, this method has a limitation that a same PCI address does not provide
> two different types.

that is a matter of perspective there are those that prefer to check all fo there configuration into git and have a
declaritive deployment and those that wish to drive everything via the api for the latter having to confiugre availbe
resouce via cybogs api or placemtn would be consider very inconvenient.

> Cyborg as an accelerator management tool is more suitable for mdev device management.
maybe but maybe not.
i do think that cyborg should support mdevs
i do not think we should have a dedicated vgpu mdev driver however.
i think we should crate a stateless mdev driver that uses a similar whitelist of allowed
mdevtypes and devices.

we did breifly discuss adding generic mdev support to nova in a future releae (w) or if we shoudl delegate
that to cyborg but i would hesitate to do that if cyborg continutes its current design where driver have
no configuraton element to whitelist device as it makes it much harder for deployment tools to properly
configure a host declaritivly.
> One solution as follow:
> Firstly, we need a vender driver(this can be a plugin),  it is used to discovery its special devices and report them
> to placement for schedule.
> The difference from the current implementation is that:
> 1. report the mdev_supported_types as traits to resource provider.
> How to discover a GPU type:
> $ ls /sys/class/mdev_bus/*/mdev_supported_types
> /sys/class/mdev_bus/0000:84:00.0/mdev_supported_types:
> nvidia-35  nvidia-36  nvidia-37  nvidia-38  nvidia-39  nvidia-40  nvidia-41  nvidia-42  nvidia-43  nvidia-44  nvidia-
> 45
> so here we report nvidia-3*, nvidia-4* as traits to resource provider.
> 2. Report the number of allocable resources instead of vGPU unit numbers to resource provider inventory
> Example for the NVidia V100 PCIe card (one GPU per board) :
> Virtual GPU Type     Frame Buffer (Gbytes)      Maximum vGPUs per GPU        Maximum vGPUs per Board
> V100D-32Q               32                        1                            1
> V100D-16Q               16                        2                            2
> V100D-8Q               8                         4                            4
> V100D-4Q               4                          8                            8
> V100D-2Q               2                         16                           16
> V100D-1Q               1                         32                           32
> so here we report 32G Buffers(an example, maybe other resources) to resource provider inventory

in this specific example that would not be a good idea.
the V100 does not support mixing mdev types on the same gpu
so if you allocate a V100D-16Q instace using 16G of the buffer you cannot then allocate 2 V100D-8Q
vgpu instance to consume the remaining 16G
other mdev based device may not have this limitation but nvida only support 1 active mdevtype per phsyical gpu.

not that the ampere generation has a dynmaic sriov based muilti instance gpu technology which kind of allow resouce
based subdivietion of the device but it does not quite work the way you are describing above.

so you can report inventories of custom resource classes of for each of the mdev types or a single inventory of VGPU
with traits modelling the avialble mdevs.

with the trait approch before a vgpu is allocated you report all traits and the total count for the inventory would be
the hightest amount e.g. 32 in this case above then when a gpu is allocated you need to update the reserved value and
remove the other traits.

if you have 1 inventory per mdev type then you set reserved = total for all inventories for the other mdev types but no
need for traits.

> 3. driver should also support a function to create certain mdev type, such as (V100D-1Q,  V100D-2Q,)
> Secondly, we need a mdev extend ARQ(it can be a plugin):
no it should not be a plugin it should jsut be another attachment type.
> Here is an example for fpga ext arq:
> https://review.opendev.org/#/c/681005/26/cyborg/objects/extarq/fpga_ext_arq.py@206
> The difference is that, we replace the _do_programming to _do_create_mdev
> For _do_programming, it is used to create a new FPGA function.
> For _do_create_mdev, it is used to create a new type mdev, it will call the implementation function in vendor driver.
well not really it will echo a uuid into a file in sysfs, that will triger the vendor driver to create the mdev but
cyborg is not actully linking to the vendor driver and invoking a c function directly.

> At last we need to support a mdev handler for xml generation in nova, we can refer to the cyborg PCI handler in nova
yes i brought this up in the nova ptg as something we will ikely need to add supprot for. as i was suggesting above
this can be done by adding a new attachment type which is mdev that jsut contains the uuid instead of the pci address.

> So after the above changes:
> Admin can create different SLA devices profiles such as:
> {“name”: “Gold_vGPU”,
>        “groups”: [
> {“resources:vGPU_BUFFERS”: “16”,
>         “traits: V100D-16Q,”: “required”,
>               }]
> }
> And
> {“name”: “Iron_vGPU”,
>        “groups”: [
> {“resources:vGPU_BUFFERS”: “1”,
>         “traits: V100D-1Q,”: “required”,
>               }]
> }
> Then a tenant can use Gold_vGPU to create with a VM with V100D-16Q vGPU
> And another tenant can use Iron_vGPU to create with a VM with V100D-1Q vGPU
it cannot do this on the same physical gpus but yes that could work
or you could do 
{“name”: “Gold_vGPU”, “groups”: [{“resources:CUSTOM_V100D-16Q”: “1”,}]}

currently for nova we just do resouces:vgpu and you can optionally do trait:CUSTOM_

by the way  we cannot use standard resouce classes or raits for the mdev types
as these are arbitary stings chosen by vendor that can potential change based
on kernel of driver version so we shoudl not add them to os-traits or os-resouce-classes
and in stead shoudl use CUSTOM_ resouce classes ro traits for them.
> When ARQ binding during the VM creating, the Cyborg will call the vendor driver to create expected mdev vGPU.
> And these 2 mdev vGPU can be on same physical GPU card.
> The mdev extend ARQ and vendor driver can be plugin, they are loose couple with the upstream code.
it should not be a plugin its a generic virtualisation attachment mode that can be used by any device
we shoudl standardise the attachemt handel in core cyborg and add support for that attachmet model in nova.
we already have support for generating the mdev xml so we would only need to wire up the handeling for the attachment
type in the code that currently handles the pci attahcmeht type.
> So the downstream can get the upstream code to customize the own mdev extend ARQ and vendor driver.
> Here vGPU is just an example, it can be other mdev devices.

yes so because this can be used for other device that is why i would counter propose that we should
create a stateless mdev driver to cover devices that do not require programming or state managment and have
a config driven interface to map mdev types to custom resouce classes and/or triats and we also need to declare per
device it it support independent pools of each mdev type or if they consome the same ressource. .i.e. to it work
like nvidea's v100s where you can only have 1 mdev type per physical device or if it works like the same device
where you can have 1 device and multiple mdev types that can be consumed in parallel.

both approches are valid although i personally prefer when they are independent pools since that is eaier to reason

you could also support a non config driven approch where we use atibutes on the deployable to describe the mapping of
the mdev type to resouce class and if its independetly consumable too i guess but that seams much more combersome to
> BR
> Shaohe Feng

More information about the openstack-discuss mailing list