[cyborg][nova] Support flexible use scenario for Mdev(such as vGPU)
Hi all, Currently openstack support vGPU as follow: https://docs.openstack.org/nova/latest/admin/virtual-gpu.html In order to support it, admin should plan ahead and configure the vGPU before deployable as follow: https://docs.openstack.org/nova/latest/configuration/config.html#devices.ena... This is very inconvenient for the administrator, this method has a limitation that a same PCI address does not provide two different types. Cyborg as an accelerator management tool is more suitable for mdev device management. One solution as follow: Firstly, we need a vender driver(this can be a plugin), it is used to discovery its special devices and report them to placement for schedule. The difference from the current implementation is that: 1. report the mdev_supported_types as traits to resource provider. How to discover a GPU type: $ ls /sys/class/mdev_bus/*/mdev_supported_types /sys/class/mdev_bus/0000:84:00.0/mdev_supported_types: nvidia-35 nvidia-36 nvidia-37 nvidia-38 nvidia-39 nvidia-40 nvidia-41 nvidia-42 nvidia-43 nvidia-44 nvidia-45 so here we report nvidia-3*, nvidia-4* as traits to resource provider. 2. Report the number of allocable resources instead of vGPU unit numbers to resource provider inventory Example for the NVidia V100 PCIe card (one GPU per board) : Virtual GPU Type Frame Buffer (Gbytes) Maximum vGPUs per GPU Maximum vGPUs per Board V100D-32Q 32 1 1 V100D-16Q 16 2 2 V100D-8Q 8 4 4 V100D-4Q 4 8 8 V100D-2Q 2 16 16 V100D-1Q 1 32 32 so here we report 32G Buffers(an example, maybe other resources) to resource provider inventory 3. driver should also support a function to create certain mdev type, such as (V100D-1Q, V100D-2Q,) Secondly, we need a mdev extend ARQ(it can be a plugin): Here is an example for fpga ext arq: https://review.opendev.org/#/c/681005/26/cyborg/objects/extarq/fpga_ext_arq.... The difference is that, we replace the _do_programming to _do_create_mdev For _do_programming, it is used to create a new FPGA function. For _do_create_mdev, it is used to create a new type mdev, it will call the implementation function in vendor driver. At last we need to support a mdev handler for xml generation in nova, we can refer to the cyborg PCI handler in nova So after the above changes: Admin can create different SLA devices profiles such as: {“name”: “Gold_vGPU”, “groups”: [ {“resources:vGPU_BUFFERS”: “16”, “traits: V100D-16Q,”: “required”, }] } And {“name”: “Iron_vGPU”, “groups”: [ {“resources:vGPU_BUFFERS”: “1”, “traits: V100D-1Q,”: “required”, }] } Then a tenant can use Gold_vGPU to create with a VM with V100D-16Q vGPU And another tenant can use Iron_vGPU to create with a VM with V100D-1Q vGPU When ARQ binding during the VM creating, the Cyborg will call the vendor driver to create expected mdev vGPU. And these 2 mdev vGPU can be on same physical GPU card. The mdev extend ARQ and vendor driver can be plugin, they are loose couple with the upstream code. So the downstream can get the upstream code to customize the own mdev extend ARQ and vendor driver. Here vGPU is just an example, it can be other mdev devices. BR Shaohe Feng
On Tue, 2020-06-23 at 05:50 +0000, Feng, Shaohe wrote:
Hi all,
Currently openstack support vGPU as follow: https://docs.openstack.org/nova/latest/admin/virtual-gpu.html
In order to support it, admin should plan ahead and configure the vGPU before deployable as follow: https://docs.openstack.org/nova/latest/configuration/config.html#devices.ena... This is very inconvenient for the administrator, this method has a limitation that a same PCI address does not provide two different types.
that is a matter of perspective there are those that prefer to check all fo there configuration into git and have a declaritive deployment and those that wish to drive everything via the api for the latter having to confiugre availbe resouce via cybogs api or placemtn would be consider very inconvenient.
Cyborg as an accelerator management tool is more suitable for mdev device management.
maybe but maybe not. i do think that cyborg should support mdevs i do not think we should have a dedicated vgpu mdev driver however. i think we should crate a stateless mdev driver that uses a similar whitelist of allowed mdevtypes and devices. we did breifly discuss adding generic mdev support to nova in a future releae (w) or if we shoudl delegate that to cyborg but i would hesitate to do that if cyborg continutes its current design where driver have no configuraton element to whitelist device as it makes it much harder for deployment tools to properly configure a host declaritivly.
One solution as follow: Firstly, we need a vender driver(this can be a plugin), it is used to discovery its special devices and report them to placement for schedule. The difference from the current implementation is that: 1. report the mdev_supported_types as traits to resource provider. How to discover a GPU type: $ ls /sys/class/mdev_bus/*/mdev_supported_types /sys/class/mdev_bus/0000:84:00.0/mdev_supported_types: nvidia-35 nvidia-36 nvidia-37 nvidia-38 nvidia-39 nvidia-40 nvidia-41 nvidia-42 nvidia-43 nvidia-44 nvidia- 45 so here we report nvidia-3*, nvidia-4* as traits to resource provider. 2. Report the number of allocable resources instead of vGPU unit numbers to resource provider inventory Example for the NVidia V100 PCIe card (one GPU per board) : Virtual GPU Type Frame Buffer (Gbytes) Maximum vGPUs per GPU Maximum vGPUs per Board V100D-32Q 32 1 1 V100D-16Q 16 2 2 V100D-8Q 8 4 4 V100D-4Q 4 8 8 V100D-2Q 2 16 16 V100D-1Q 1 32 32 so here we report 32G Buffers(an example, maybe other resources) to resource provider inventory
in this specific example that would not be a good idea. the V100 does not support mixing mdev types on the same gpu so if you allocate a V100D-16Q instace using 16G of the buffer you cannot then allocate 2 V100D-8Q vgpu instance to consume the remaining 16G other mdev based device may not have this limitation but nvida only support 1 active mdevtype per phsyical gpu. not that the ampere generation has a dynmaic sriov based muilti instance gpu technology which kind of allow resouce based subdivietion of the device but it does not quite work the way you are describing above. so you can report inventories of custom resource classes of for each of the mdev types or a single inventory of VGPU with traits modelling the avialble mdevs. with the trait approch before a vgpu is allocated you report all traits and the total count for the inventory would be the hightest amount e.g. 32 in this case above then when a gpu is allocated you need to update the reserved value and remove the other traits. if you have 1 inventory per mdev type then you set reserved = total for all inventories for the other mdev types but no need for traits.
3. driver should also support a function to create certain mdev type, such as (V100D-1Q, V100D-2Q,) Secondly, we need a mdev extend ARQ(it can be a plugin): no it should not be a plugin it should jsut be another attachment type. Here is an example for fpga ext arq: https://review.opendev.org/#/c/681005/26/cyborg/objects/extarq/fpga_ext_arq.... The difference is that, we replace the _do_programming to _do_create_mdev For _do_programming, it is used to create a new FPGA function. For _do_create_mdev, it is used to create a new type mdev, it will call the implementation function in vendor driver. well not really it will echo a uuid into a file in sysfs, that will triger the vendor driver to create the mdev but cyborg is not actully linking to the vendor driver and invoking a c function directly.
At last we need to support a mdev handler for xml generation in nova, we can refer to the cyborg PCI handler in nova
yes i brought this up in the nova ptg as something we will ikely need to add supprot for. as i was suggesting above this can be done by adding a new attachment type which is mdev that jsut contains the uuid instead of the pci address.
So after the above changes: Admin can create different SLA devices profiles such as: {“name”: “Gold_vGPU”, “groups”: [ {“resources:vGPU_BUFFERS”: “16”, “traits: V100D-16Q,”: “required”, }] } And {“name”: “Iron_vGPU”, “groups”: [ {“resources:vGPU_BUFFERS”: “1”, “traits: V100D-1Q,”: “required”, }] } Then a tenant can use Gold_vGPU to create with a VM with V100D-16Q vGPU And another tenant can use Iron_vGPU to create with a VM with V100D-1Q vGPU it cannot do this on the same physical gpus but yes that could work or you could do {“name”: “Gold_vGPU”, “groups”: [{“resources:CUSTOM_V100D-16Q”: “1”,}]}
currently for nova we just do resouces:vgpu and you can optionally do trait:CUSTOM_ by the way we cannot use standard resouce classes or raits for the mdev types as these are arbitary stings chosen by vendor that can potential change based on kernel of driver version so we shoudl not add them to os-traits or os-resouce-classes and in stead shoudl use CUSTOM_ resouce classes ro traits for them.
When ARQ binding during the VM creating, the Cyborg will call the vendor driver to create expected mdev vGPU. And these 2 mdev vGPU can be on same physical GPU card.
The mdev extend ARQ and vendor driver can be plugin, they are loose couple with the upstream code.
it should not be a plugin its a generic virtualisation attachment mode that can be used by any device we shoudl standardise the attachemt handel in core cyborg and add support for that attachmet model in nova. we already have support for generating the mdev xml so we would only need to wire up the handeling for the attachment type in the code that currently handles the pci attahcmeht type.
So the downstream can get the upstream code to customize the own mdev extend ARQ and vendor driver. Here vGPU is just an example, it can be other mdev devices.
yes so because this can be used for other device that is why i would counter propose that we should create a stateless mdev driver to cover devices that do not require programming or state managment and have a config driven interface to map mdev types to custom resouce classes and/or triats and we also need to declare per device it it support independent pools of each mdev type or if they consome the same ressource. .i.e. to it work like nvidea's v100s where you can only have 1 mdev type per physical device or if it works like the same device where you can have 1 device and multiple mdev types that can be consumed in parallel. both approches are valid although i personally prefer when they are independent pools since that is eaier to reason about. you could also support a non config driven approch where we use atibutes on the deployable to describe the mapping of the mdev type to resouce class and if its independetly consumable too i guess but that seams much more combersome to manage.
BR Shaohe Feng
-----Original Message----- From: Sean Mooney <smooney@redhat.com> Sent: 2020年6月23日 19:48 To: Feng, Shaohe <shaohe.feng@intel.com>; openstack-discuss@lists.openstack.org Cc: yumeng_bao@yahoo.com; shhfeng@126.com Subject: Re: [cyborg][nova] Support flexible use scenario for Mdev(such as vGPU) On Tue, 2020-06-23 at 05:50 +0000, Feng, Shaohe wrote:
Hi all,
Currently openstack support vGPU as follow: https://docs.openstack.org/nova/latest/admin/virtual-gpu.html
In order to support it, admin should plan ahead and configure the vGPU before deployable as follow: https://docs.openstack.org/nova/latest/configuration/config.html#devic es.enabled_vgpu_types This is very inconvenient for the administrator, this method has a limitation that a same PCI address does not provide two different types.
that is a matter of perspective there are those that prefer to check all fo there configuration into git and have a declaritive deployment and those that wish to drive everything via the api for the latter having to confiugre availbe resouce via cybogs api or placemtn would be consider very inconvenient.
Cyborg as an accelerator management tool is more suitable for mdev device management.
maybe but maybe not. i do think that cyborg should support mdevs i do not think we should have a dedicated vgpu mdev driver however. i think we should crate a stateless mdev driver that uses a similar whitelist of allowed mdevtypes and devices. we did breifly discuss adding generic mdev support to nova in a future releae (w) or if we shoudl delegate that to cyborg but i would hesitate to do that if cyborg continutes its current design where driver have no configuraton element to whitelist device as it makes it much harder for deployment tools to properly configure a host declaritivly. [Feng, Shaohe] We did support config, such as our demo for fpga pre-program, we support config for our new drivers. And such as other accelerators, maybe the infra also need accelerators for acceleration not only VM needs. For example, cinder can use QAT for compress/crypto, and VM also can QAT. We need to configure which QATs are for infra and which for VMs.
One solution as follow: Firstly, we need a vender driver(this can be a plugin), it is used to discovery its special devices and report them to placement for schedule. The difference from the current implementation is that: 1. report the mdev_supported_types as traits to resource provider. How to discover a GPU type: $ ls /sys/class/mdev_bus/*/mdev_supported_types /sys/class/mdev_bus/0000:84:00.0/mdev_supported_types: nvidia-35 nvidia-36 nvidia-37 nvidia-38 nvidia-39 nvidia-40 nvidia-41 nvidia-42 nvidia-43 nvidia-44 nvidia- 45 so here we report nvidia-3*, nvidia-4* as traits to resource provider. 2. Report the number of allocable resources instead of vGPU unit numbers to resource provider inventory Example for the NVidia V100 PCIe card (one GPU per board) : Virtual GPU Type Frame Buffer (Gbytes) Maximum vGPUs per GPU Maximum vGPUs per Board V100D-32Q 32 1 1 V100D-16Q 16 2 2 V100D-8Q 8 4 4 V100D-4Q 4 8 8 V100D-2Q 2 16 16 V100D-1Q 1 32 32 so here we report 32G Buffers(an example, maybe other resources) to resource provider inventory
in this specific example that would not be a good idea. the V100 does not support mixing mdev types on the same gpu so if you allocate a V100D-16Q instace using 16G of the buffer you cannot then allocate 2 V100D-8Q vgpu instance to consume the remaining 16G other mdev based device may not have this limitation but nvida only support 1 active mdevtype per phsyical gpu. not that the ampere generation has a dynmaic sriov based muilti instance gpu technology which kind of allow resouce based subdivietion of the device but it does not quite work the way you are describing above. so you can report inventories of custom resource classes of for each of the mdev types or a single inventory of VGPU with traits modelling the avialble mdevs. with the trait approch before a vgpu is allocated you report all traits and the total count for the inventory would be the hightest amount e.g. 32 in this case above then when a gpu is allocated you need to update the reserved value and remove the other traits. if you have 1 inventory per mdev type then you set reserved = total for all inventories for the other mdev types but no need for traits. [Feng, Shaohe] Oh, really sorry, I should choose a good example.
3. driver should also support a function to create certain mdev type, such as (V100D-1Q, V100D-2Q,) Secondly, we need a mdev extend ARQ(it can be a plugin): no it should not be a plugin it should jsut be another attachment type. Here is an example for fpga ext arq: https://review.opendev.org/#/c/681005/26/cyborg/objects/extarq/fpga_ex t_arq.py@206 The difference is that, we replace the _do_programming to _do_create_mdev For _do_programming, it is used to create a new FPGA function. For _do_create_mdev, it is used to create a new type mdev, it will call the implementation function in vendor driver. well not really it will echo a uuid into a file in sysfs, that will triger the vendor driver to create the mdev but cyborg is not actully linking to the vendor driver and invoking a c function directly.
At last we need to support a mdev handler for xml generation in nova, we can refer to the cyborg PCI handler in nova
yes i brought this up in the nova ptg as something we will ikely need to add supprot for. as i was suggesting above this can be done by adding a new attachment type which is mdev that jsut contains the uuid instead of the pci address.
So after the above changes: Admin can create different SLA devices profiles such as: {“name”: “Gold_vGPU”, “groups”: [ {“resources:vGPU_BUFFERS”: “16”, “traits: V100D-16Q,”: “required”, }] } And {“name”: “Iron_vGPU”, “groups”: [ {“resources:vGPU_BUFFERS”: “1”, “traits: V100D-1Q,”: “required”, }] } Then a tenant can use Gold_vGPU to create with a VM with V100D-16Q vGPU And another tenant can use Iron_vGPU to create with a VM with V100D-1Q vGPU it cannot do this on the same physical gpus but yes that could work or you could do {“name”: “Gold_vGPU”, “groups”: [{“resources:CUSTOM_V100D-16Q”: “1”,}]}
currently for nova we just do resouces:vgpu and you can optionally do trait:CUSTOM_ by the way we cannot use standard resouce classes or raits for the mdev types as these are arbitary stings chosen by vendor that can potential change based on kernel of driver version so we shoudl not add them to os-traits or os-resouce-classes and in stead shoudl use CUSTOM_ resouce classes ro traits for them. [Feng, Shaohe] Yes use use CUSTOM_ for them.
When ARQ binding during the VM creating, the Cyborg will call the vendor driver to create expected mdev vGPU. And these 2 mdev vGPU can be on same physical GPU card.
The mdev extend ARQ and vendor driver can be plugin, they are loose couple with the upstream code.
it should not be a plugin its a generic virtualisation attachment mode that can be used by any device we shoudl standardise the attachemt handel in core cyborg and add support for that attachmet model in nova. we already have support for generating the mdev xml so we would only need to wire up the handeling for the attachment type in the code that currently handles the pci attahcmeht type. [Feng, Shaohe] Yes, we will support stand generic ARQ. Only extend ARQ for some special accelerators, FPGA is an example.
So the downstream can get the upstream code to customize the own mdev extend ARQ and vendor driver. Here vGPU is just an example, it can be other mdev devices.
yes so because this can be used for other device that is why i would counter propose that we should create a stateless mdev driver to cover devices that do not require programming or state managment and have a config driven interface to map mdev types to custom resouce classes and/or triats and we also need to declare per device it it support independent pools of each mdev type or if they consome the same ressource. .i.e. to it work like nvidea's v100s where you can only have 1 mdev type per physical device or if it works like the same device where you can have 1 device and multiple mdev types that can be consumed in parallel. both approches are valid although i personally prefer when they are independent pools since that is eaier to reason about. you could also support a non config driven approch where we use atibutes on the deployable to describe the mapping of the mdev type to resouce class and if its independetly consumable too i guess but that seams much more combersome to manage. [Feng, Shaohe] Both stateless mdev driver and non config driven can be support. If these cannot satisfy the users, users add their special mdev driver by themselves.
BR Shaohe Feng
On Tue, 2020-06-23 at 12:25 +0000, Feng, Shaohe wrote:
-----Original Message----- From: Sean Mooney <smooney@redhat.com> Sent: 2020年6月23日 19:48 To: Feng, Shaohe <shaohe.feng@intel.com>; openstack-discuss@lists.openstack.org Cc: yumeng_bao@yahoo.com; shhfeng@126.com Subject: Re: [cyborg][nova] Support flexible use scenario for Mdev(such as vGPU)
On Tue, 2020-06-23 at 05:50 +0000, Feng, Shaohe wrote:
Hi all,
Currently openstack support vGPU as follow: https://docs.openstack.org/nova/latest/admin/virtual-gpu.html
In order to support it, admin should plan ahead and configure the vGPU before deployable as follow: https://docs.openstack.org/nova/latest/configuration/config.html#devic es.enabled_vgpu_types This is very inconvenient for the administrator, this method has a limitation that a same PCI address does not provide two different types.
that is a matter of perspective there are those that prefer to check all fo there configuration into git and have a declaritive deployment and those that wish to drive everything via the api for the latter having to confiugre availbe resouce via cybogs api or placemtn would be consider very inconvenient.
Cyborg as an accelerator management tool is more suitable for mdev device management.
maybe but maybe not. i do think that cyborg should support mdevs i do not think we should have a dedicated vgpu mdev driver however. i think we should crate a stateless mdev driver that uses a similar whitelist of allowed mdevtypes and devices.
we did breifly discuss adding generic mdev support to nova in a future releae (w) or if we shoudl delegate that to cyborg but i would hesitate to do that if cyborg continutes its current design where driver have no configuraton element to whitelist device as it makes it much harder for deployment tools to properly configure a host declaritivly. [Feng, Shaohe] We did support config, such as our demo for fpga pre-program, we support config for our new drivers. And such as other accelerators, maybe the infra also need accelerators for acceleration not only VM needs. For example, cinder can use QAT for compress/crypto, and VM also can QAT. We need to configure which QATs are for infra and which for VMs.
yes qat is a good examlple of where shareign between host useage(cinder service) and guest usage(vms) could be required the sam could be true of gpus. typically many servers run headleas but not always and sometime you will want to resrve a gpu for the host to use. nics are another good example wehen we look at generic cpi passthough we need to select whic nics will be used by the vms and which will be used for host connectivity or fo hardware offloed ovs.
One solution as follow: Firstly, we need a vender driver(this can be a plugin), it is used to discovery its special devices and report them to placement for schedule. The difference from the current implementation is that: 1. report the mdev_supported_types as traits to resource provider. How to discover a GPU type: $ ls /sys/class/mdev_bus/*/mdev_supported_types /sys/class/mdev_bus/0000:84:00.0/mdev_supported_types: nvidia-35 nvidia-36 nvidia-37 nvidia-38 nvidia-39 nvidia-40 nvidia-41 nvidia-42 nvidia-43 nvidia-44 nvidia- 45 so here we report nvidia-3*, nvidia-4* as traits to resource provider. 2. Report the number of allocable resources instead of vGPU unit numbers to resource provider inventory Example for the NVidia V100 PCIe card (one GPU per board) : Virtual GPU Type Frame Buffer (Gbytes) Maximum vGPUs per GPU Maximum vGPUs per Board V100D-32Q 32 1 1 V100D-16Q 16 2 2 V100D-8Q 8 4 4 V100D-4Q 4 8 8 V100D-2Q 2 16 16 V100D-1Q 1 32 32 so here we report 32G Buffers(an example, maybe other resources) to resource provider inventory
in this specific example that would not be a good idea. the V100 does not support mixing mdev types on the same gpu so if you allocate a V100D-16Q instace using 16G of the buffer you cannot then allocate 2 V100D-8Q vgpu instance to consume the remaining 16G other mdev based device may not have this limitation but nvida only support 1 active mdevtype per phsyical gpu.
not that the ampere generation has a dynmaic sriov based muilti instance gpu technology which kind of allow resouce based subdivietion of the device but it does not quite work the way you are describing above.
so you can report inventories of custom resource classes of for each of the mdev types or a single inventory of VGPU with traits modelling the avialble mdevs.
with the trait approch before a vgpu is allocated you report all traits and the total count for the inventory would be the hightest amount e.g. 32 in this case above then when a gpu is allocated you need to update the reserved value and remove the other traits.
if you have 1 inventory per mdev type then you set reserved = total for all inventories for the other mdev types but no need for traits. [Feng, Shaohe] Oh, really sorry, I should choose a good example.
the sample mdpy kernel module which create a basic virtual graphice deice support multiple mdev type for different resolutions https://github.com/torvalds/linux/blob/f97c81dc6ca5996560b3944064f63fc87eb18... i belive it also support consuming each mdev type independtly. so if you dont want to use real hardware as an example ther eare at least sample devicce that support having multiple active mdevs. i would also suggest we use this device for testing in the upstream gate. i started creating a jobs to test novas vgpu support with this smaple device a few months back but we need to make a few small change to make it work and i was not sure it was approate to modify the nova code just to get ci working with a fake device. currently we make 1 assumtion about that parent of the mdev being a pci deivce which is not true in the kernel module case. but form a cyborg perspective you can avoid that mistake since mdevs can be created for device on any bus like usb or upi as well as pcie.
3. driver should also support a function to create certain mdev type, such as (V100D-1Q, V100D-2Q,) Secondly, we need a mdev extend ARQ(it can be a plugin):
no it should not be a plugin it should jsut be another attachment type.
Here is an example for fpga ext arq: https://review.opendev.org/#/c/681005/26/cyborg/objects/extarq/fpga_ex t_arq.py@206 The difference is that, we replace the _do_programming to _do_create_mdev For _do_programming, it is used to create a new FPGA function. For _do_create_mdev, it is used to create a new type mdev, it will call the implementation function in vendor driver.
well not really it will echo a uuid into a file in sysfs, that will triger the vendor driver to create the mdev but cyborg is not actully linking to the vendor driver and invoking a c function directly.
At last we need to support a mdev handler for xml generation in nova, we can refer to the cyborg PCI handler in nova
yes i brought this up in the nova ptg as something we will ikely need to add supprot for. as i was suggesting above this can be done by adding a new attachment type which is mdev that jsut contains the uuid instead of the pci address.
So after the above changes: Admin can create different SLA devices profiles such as: {“name”: “Gold_vGPU”, “groups”: [ {“resources:vGPU_BUFFERS”: “16”, “traits: V100D-16Q,”: “required”, }] } And {“name”: “Iron_vGPU”, “groups”: [ {“resources:vGPU_BUFFERS”: “1”, “traits: V100D-1Q,”: “required”, }] } Then a tenant can use Gold_vGPU to create with a VM with V100D-16Q vGPU And another tenant can use Iron_vGPU to create with a VM with V100D-1Q vGPU
it cannot do this on the same physical gpus but yes that could work or you could do {“name”: “Gold_vGPU”, “groups”: [{“resources:CUSTOM_V100D-16Q”: “1”,}]}
currently for nova we just do resouces:vgpu and you can optionally do trait:CUSTOM_
by the way we cannot use standard resouce classes or raits for the mdev types as these are arbitary stings chosen by vendor that can potential change based on kernel of driver version so we shoudl not add them to os-traits or os- resouce-classes and in stead shoudl use CUSTOM_ resouce classes ro traits for them. [Feng, Shaohe] Yes use use CUSTOM_ for them.
When ARQ binding during the VM creating, the Cyborg will call the vendor driver to create expected mdev vGPU. And these 2 mdev vGPU can be on same physical GPU card.
The mdev extend ARQ and vendor driver can be plugin, they are loose couple with the upstream code.
it should not be a plugin its a generic virtualisation attachment mode that can be used by any device we shoudl standardise the attachemt handel in core cyborg and add support for that attachmet model in nova. we already have support for generating the mdev xml so we would only need to wire up the handeling for the attachment type in the code that currently handles the pci attahcmeht type. [Feng, Shaohe] Yes, we will support stand generic ARQ. Only extend ARQ for some special accelerators, FPGA is an example.
i am not sure we need to extend ARQ for FPGA but perhaps at some point. nova does nto support plugins in general so as long as the info we need to request or recive does not vary based on cyborg plugins i guess it coudl be ok but outside of the driver layer i woudl perfosnally avoid intoduciing plugins to cyborg.
So the downstream can get the upstream code to customize the own mdev extend ARQ and vendor driver. Here vGPU is just an example, it can be other mdev devices.
yes so because this can be used for other device that is why i would counter propose that we should create a stateless mdev driver to cover devices that do not require programming or state managment and have a config driven interface to map mdev types to custom resouce classes and/or triats and we also need to declare per device it it support independent pools of each mdev type or if they consome the same ressource. .i.e. to it work like nvidea's v100s where you can only have 1 mdev type per physical device or if it works like the same device where you can have 1 device and multiple mdev types that can be consumed in parallel.
both approches are valid although i personally prefer when they are independent pools since that is eaier to reason about.
you could also support a non config driven approch where we use atibutes on the deployable to describe the mapping of the mdev type to resouce class and if its independetly consumable too i guess but that seams much more combersome to manage. [Feng, Shaohe] Both stateless mdev driver and non config driven can be support. If these cannot satisfy the users, users add their special mdev driver by themselves. well if i put my downstream had on in terms of productization we are currntly debating if we want to include cyborg in a futrue release of Redhat openstack plathform. at the moment it is not planed for our next major release winch is osp 17 and is in consideration for osp 18. one of the concerns we have with adding cyborg to a futre releae is the lack of a config driven approch. it is not a blocker but haveing an api only based apporch whil it has some advantages also has several drawback not lest of which is supportablity and day 1 and day 2 operational complexity.
for example we have a tool that customer use when reporting bugs called sosreport which automates teh collection of logs config and other system inforamtion. it can also invoke some command like virsh list etc but adding a new config and log to collect is signifcantly less work then adding a new command that needs to discover the node uuid then quiry placement and cyborg apis to determin what acclereator are avaiable and how they are configred. so api only configred service are harder for distos to support from a day perspective when things go wrong. form a day one persepctive it is also simpler for installer tools to template out a config file then it is to inovke commands imperitivaly against an api. so i understand that form an operators perpesctive invoking an api to do all config managment remotely might be quite appeling but it comes with trade offs. upstreasm should not be governed by downstream distro concerns but we should be mindful not to make it overly hard to deploy and manage the service as it raise the barrier to integrate and if that is too high then that can restul in not being supported in the long run. that said having a service that you just deploy and have no configuration to do would be nice too but if we tell our user that after they deploy cyborg the must then iterate of every deployable and decide dit they should enable it and what atibutes to add to it to schduler correctly i think the complexity might be to high for many.
BR Shaohe Feng
-----Original Message----- From: Sean Mooney <smooney@redhat.com> Sent: 2020年6月23日 21:38 To: Feng, Shaohe <shaohe.feng@intel.com>; openstack-discuss@lists.openstack.org Cc: yumeng_bao@yahoo.com; shhfeng@126.com Subject: Re: [cyborg][nova] Support flexible use scenario for Mdev(such as vGPU) On Tue, 2020-06-23 at 12:25 +0000, Feng, Shaohe wrote:
-----Original Message----- From: Sean Mooney <smooney@redhat.com> Sent: 2020年6月23日 19:48 To: Feng, Shaohe <shaohe.feng@intel.com>; openstack-discuss@lists.openstack.org Cc: yumeng_bao@yahoo.com; shhfeng@126.com Subject: Re: [cyborg][nova] Support flexible use scenario for Mdev(such as vGPU)
On Tue, 2020-06-23 at 05:50 +0000, Feng, Shaohe wrote:
Hi all,
Currently openstack support vGPU as follow: https://docs.openstack.org/nova/latest/admin/virtual-gpu.html
In order to support it, admin should plan ahead and configure the vGPU before deployable as follow: https://docs.openstack.org/nova/latest/configuration/config.html#dev ic es.enabled_vgpu_types This is very inconvenient for the administrator, this method has a limitation that a same PCI address does not provide two different types.
that is a matter of perspective there are those that prefer to check all fo there configuration into git and have a declaritive deployment and those that wish to drive everything via the api for the latter having to confiugre availbe resouce via cybogs api or placemtn would be consider very inconvenient.
Cyborg as an accelerator management tool is more suitable for mdev device management.
maybe but maybe not. i do think that cyborg should support mdevs i do not think we should have a dedicated vgpu mdev driver however. i think we should crate a stateless mdev driver that uses a similar whitelist of allowed mdevtypes and devices.
we did breifly discuss adding generic mdev support to nova in a future releae (w) or if we shoudl delegate that to cyborg but i would hesitate to do that if cyborg continutes its current design where driver have no configuraton element to whitelist device as it makes it much harder for deployment tools to properly configure a host declaritivly. [Feng, Shaohe] We did support config, such as our demo for fpga pre-program, we support config for our new drivers. And such as other accelerators, maybe the infra also need accelerators for acceleration not only VM needs. For example, cinder can use QAT for compress/crypto, and VM also can QAT. We need to configure which QATs are for infra and which for VMs.
yes qat is a good examlple of where shareign between host useage(cinder service) and guest usage(vms) could be required the sam could be true of gpus. typically many servers run headleas but not always and sometime you will want to resrve a gpu for the host to use. nics are another good example wehen we look at generic cpi passthough we need to select whic nics will be used by the vms and which will be used for host connectivity or fo hardware offloed ovs.
One solution as follow: Firstly, we need a vender driver(this can be a plugin), it is used to discovery its special devices and report them to placement for schedule. The difference from the current implementation is that: 1. report the mdev_supported_types as traits to resource provider. How to discover a GPU type: $ ls /sys/class/mdev_bus/*/mdev_supported_types /sys/class/mdev_bus/0000:84:00.0/mdev_supported_types: nvidia-35 nvidia-36 nvidia-37 nvidia-38 nvidia-39 nvidia-40 nvidia-41 nvidia-42 nvidia-43 nvidia-44 nvidia- 45 so here we report nvidia-3*, nvidia-4* as traits to resource provider. 2. Report the number of allocable resources instead of vGPU unit numbers to resource provider inventory Example for the NVidia V100 PCIe card (one GPU per board) : Virtual GPU Type Frame Buffer (Gbytes) Maximum vGPUs per GPU Maximum vGPUs per Board V100D-32Q 32 1 1 V100D-16Q 16 2 2 V100D-8Q 8 4 4 V100D-4Q 4 8 8 V100D-2Q 2 16 16 V100D-1Q 1 32 32 so here we report 32G Buffers(an example, maybe other resources) to resource provider inventory
in this specific example that would not be a good idea. the V100 does not support mixing mdev types on the same gpu so if you allocate a V100D-16Q instace using 16G of the buffer you cannot then allocate 2 V100D-8Q vgpu instance to consume the remaining 16G other mdev based device may not have this limitation but nvida only support 1 active mdevtype per phsyical gpu.
not that the ampere generation has a dynmaic sriov based muilti instance gpu technology which kind of allow resouce based subdivietion of the device but it does not quite work the way you are describing above.
so you can report inventories of custom resource classes of for each of the mdev types or a single inventory of VGPU with traits modelling the avialble mdevs.
with the trait approch before a vgpu is allocated you report all traits and the total count for the inventory would be the hightest amount e.g. 32 in this case above then when a gpu is allocated you need to update the reserved value and remove the other traits.
[Feng, Shaohe] For V100 does not support mixing mdev types, that need to remove the other traits. So any suggestion about how a generic driver support both mixing types mdev and single type mdev?
if you have 1 inventory per mdev type then you set reserved = total for all inventories for the other mdev types but no need for traits. [Feng, Shaohe] Oh, really sorry, I should choose a good example.
the sample mdpy kernel module which create a basic virtual graphice deice support multiple mdev type for different resolutions https://github.com/torvalds/linux/blob/f97c81dc6ca5996560b3944064f63fc87eb18... i belive it also support consuming each mdev type independtly. so if you dont want to use real hardware as an example ther eare at least sample devicce that support having multiple active mdevs. i would also suggest we use this device for testing in the upstream gate. i started creating a jobs to test novas vgpu support with this smaple device a few months back but we need to make a few small change to make it work and i was not sure it was approate to modify the nova code just to get ci working with a fake device. currently we make 1 assumtion about that parent of the mdev being a pci deivce which is not true in the kernel module case. but form a cyborg perspective you can avoid that mistake since mdevs can be created for device on any bus like usb or upi as well as pcie. [Feng, Shaohe] Good suggestion.
3. driver should also support a function to create certain mdev type, such as (V100D-1Q, V100D-2Q,) Secondly, we need a mdev extend ARQ(it can be a plugin):
no it should not be a plugin it should jsut be another attachment type.
Here is an example for fpga ext arq: https://review.opendev.org/#/c/681005/26/cyborg/objects/extarq/fpga_ ex t_arq.py@206 The difference is that, we replace the _do_programming to _do_create_mdev For _do_programming, it is used to create a new FPGA function. For _do_create_mdev, it is used to create a new type mdev, it will call the implementation function in vendor driver.
well not really it will echo a uuid into a file in sysfs, that will triger the vendor driver to create the mdev but cyborg is not actully linking to the vendor driver and invoking a c function directly.
At last we need to support a mdev handler for xml generation in nova, we can refer to the cyborg PCI handler in nova
yes i brought this up in the nova ptg as something we will ikely need to add supprot for. as i was suggesting above this can be done by adding a new attachment type which is mdev that jsut contains the uuid instead of the pci address.
So after the above changes: Admin can create different SLA devices profiles such as: {“name”: “Gold_vGPU”, “groups”: [ {“resources:vGPU_BUFFERS”: “16”, “traits: V100D-16Q,”: “required”, }] } And {“name”: “Iron_vGPU”, “groups”: [ {“resources:vGPU_BUFFERS”: “1”, “traits: V100D-1Q,”: “required”, }] } Then a tenant can use Gold_vGPU to create with a VM with V100D-16Q vGPU And another tenant can use Iron_vGPU to create with a VM with V100D-1Q vGPU
it cannot do this on the same physical gpus but yes that could work or you could do {“name”: “Gold_vGPU”, “groups”: [{“resources:CUSTOM_V100D-16Q”: “1”,}]}
currently for nova we just do resouces:vgpu and you can optionally do trait:CUSTOM_
by the way we cannot use standard resouce classes or raits for the mdev types as these are arbitary stings chosen by vendor that can potential change based on kernel of driver version so we shoudl not add them to os-traits or os- resouce-classes and in stead shoudl use CUSTOM_ resouce classes ro traits for them. [Feng, Shaohe] Yes use use CUSTOM_ for them.
When ARQ binding during the VM creating, the Cyborg will call the vendor driver to create expected mdev vGPU. And these 2 mdev vGPU can be on same physical GPU card.
The mdev extend ARQ and vendor driver can be plugin, they are loose couple with the upstream code.
it should not be a plugin its a generic virtualisation attachment mode that can be used by any device we shoudl standardise the attachemt handel in core cyborg and add support for that attachmet model in nova. we already have support for generating the mdev xml so we would only need to wire up the handeling for the attachment type in the code that currently handles the pci attahcmeht type. [Feng, Shaohe] Yes, we will support stand generic ARQ. Only extend ARQ for some special accelerators, FPGA is an example.
i am not sure we need to extend ARQ for FPGA but perhaps at some point. nova does nto support plugins in general so as long as the info we need to request or recive does not vary based on cyborg plugins i guess it coudl be ok but outside of the driver layer i woudl perfosnally avoid intoduciing plugins to cyborg.
So the downstream can get the upstream code to customize the own mdev extend ARQ and vendor driver. Here vGPU is just an example, it can be other mdev devices.
yes so because this can be used for other device that is why i would counter propose that we should create a stateless mdev driver to cover devices that do not require programming or state managment and have a config driven interface to map mdev types to custom resouce classes and/or triats and we also need to declare per device it it support independent pools of each mdev type or if they consome the same ressource. .i.e. to it work like nvidea's v100s where you can only have 1 mdev type per physical device or if it works like the same device where you can have 1 device and multiple mdev types that can be consumed in parallel.
both approches are valid although i personally prefer when they are independent pools since that is eaier to reason about.
you could also support a non config driven approch where we use atibutes on the deployable to describe the mapping of the mdev type to resouce class and if its independetly consumable too i guess but that seams much more combersome to manage. [Feng, Shaohe] Both stateless mdev driver and non config driven can be support. If these cannot satisfy the users, users add their special mdev driver by themselves. well if i put my downstream had on in terms of productization we are currntly debating if we want to include cyborg in a futrue release of Redhat openstack plathform. at the moment it is not planed for our next major release winch is osp 17 and is in consideration for osp 18. one of the concerns we have with adding cyborg to a futre releae is the lack of a config driven approch. it is not a blocker but haveing an api only based apporch whil it has some advantages also has several drawback not lest of which is supportablity and day 1 and day 2 operational complexity.
for example we have a tool that customer use when reporting bugs called sosreport which automates teh collection of logs config and other system inforamtion. it can also invoke some command like virsh list etc but adding a new config and log to collect is signifcantly less work then adding a new command that needs to discover the node uuid then quiry placement and cyborg apis to determin what acclereator are avaiable and how they are configred. so api only configred service are harder for distos to support from a day perspective when things go wrong. form a day one persepctive it is also simpler for installer tools to template out a config file then it is to inovke commands imperitivaly against an api. so i understand that form an operators perpesctive invoking an api to do all config managment remotely might be quite appeling but it comes with trade offs. upstreasm should not be governed by downstream distro concerns but we should be mindful not to make it overly hard to deploy and manage the service as it raise the barrier to integrate and if that is too high then that can restul in not being supported in the long run. that said having a service that you just deploy and have no configuration to do would be nice too but if we tell our user that after they deploy cyborg the must then iterate of every deployable and decide dit they should enable it and what atibutes to add to it to schduler correctly i think the complexity might be to high for many.
BR Shaohe Feng
Hi Yumeng and Xin-ran: Not sure you noticed that Sean Mooney has brought up that nova support mdev attachment type in the nova PTG, as follow:
yes i brought this up in the nova ptg as something we will ikely need to add supprot for. as i was suggesting above this can be done by adding a new attachment type which is mdev that jsut contains the uuid instead of the pci address.
IMHO, cyborg can support a mdev fake driver(similar to the current FPGA fake driver) for the mdev attachment type support in nova Or maybe we can extend the current fake driver support both mdev and pci devices BR Shaohe -----Original Message----- From: Sean Mooney <smooney@redhat.com> Sent: 2020年6月23日 21:38 To: Feng, Shaohe <shaohe.feng@intel.com>; openstack-discuss@lists.openstack.org Cc: yumeng_bao@yahoo.com; shhfeng@126.com Subject: Re: [cyborg][nova] Support flexible use scenario for Mdev(such as vGPU) On Tue, 2020-06-23 at 12:25 +0000, Feng, Shaohe wrote:
-----Original Message----- From: Sean Mooney <smooney@redhat.com> Sent: 2020年6月23日 19:48 To: Feng, Shaohe <shaohe.feng@intel.com>; openstack-discuss@lists.openstack.org Cc: yumeng_bao@yahoo.com; shhfeng@126.com Subject: Re: [cyborg][nova] Support flexible use scenario for Mdev(such as vGPU)
On Tue, 2020-06-23 at 05:50 +0000, Feng, Shaohe wrote:
Hi all,
Currently openstack support vGPU as follow: https://docs.openstack.org/nova/latest/admin/virtual-gpu.html
In order to support it, admin should plan ahead and configure the vGPU before deployable as follow: https://docs.openstack.org/nova/latest/configuration/config.html#dev ic es.enabled_vgpu_types This is very inconvenient for the administrator, this method has a limitation that a same PCI address does not provide two different types.
that is a matter of perspective there are those that prefer to check all fo there configuration into git and have a declaritive deployment and those that wish to drive everything via the api for the latter having to confiugre availbe resouce via cybogs api or placemtn would be consider very inconvenient.
Cyborg as an accelerator management tool is more suitable for mdev device management.
maybe but maybe not. i do think that cyborg should support mdevs i do not think we should have a dedicated vgpu mdev driver however. i think we should crate a stateless mdev driver that uses a similar whitelist of allowed mdevtypes and devices.
we did breifly discuss adding generic mdev support to nova in a future releae (w) or if we shoudl delegate that to cyborg but i would hesitate to do that if cyborg continutes its current design where driver have no configuraton element to whitelist device as it makes it much harder for deployment tools to properly configure a host declaritivly. [Feng, Shaohe] We did support config, such as our demo for fpga pre-program, we support config for our new drivers. And such as other accelerators, maybe the infra also need accelerators for acceleration not only VM needs. For example, cinder can use QAT for compress/crypto, and VM also can QAT. We need to configure which QATs are for infra and which for VMs.
yes qat is a good examlple of where shareign between host useage(cinder service) and guest usage(vms) could be required the sam could be true of gpus. typically many servers run headleas but not always and sometime you will want to resrve a gpu for the host to use. nics are another good example wehen we look at generic cpi passthough we need to select whic nics will be used by the vms and which will be used for host connectivity or fo hardware offloed ovs.
One solution as follow: Firstly, we need a vender driver(this can be a plugin), it is used to discovery its special devices and report them to placement for schedule. The difference from the current implementation is that: 1. report the mdev_supported_types as traits to resource provider. How to discover a GPU type: $ ls /sys/class/mdev_bus/*/mdev_supported_types /sys/class/mdev_bus/0000:84:00.0/mdev_supported_types: nvidia-35 nvidia-36 nvidia-37 nvidia-38 nvidia-39 nvidia-40 nvidia-41 nvidia-42 nvidia-43 nvidia-44 nvidia- 45 so here we report nvidia-3*, nvidia-4* as traits to resource provider. 2. Report the number of allocable resources instead of vGPU unit numbers to resource provider inventory Example for the NVidia V100 PCIe card (one GPU per board) : Virtual GPU Type Frame Buffer (Gbytes) Maximum vGPUs per GPU Maximum vGPUs per Board V100D-32Q 32 1 1 V100D-16Q 16 2 2 V100D-8Q 8 4 4 V100D-4Q 4 8 8 V100D-2Q 2 16 16 V100D-1Q 1 32 32 so here we report 32G Buffers(an example, maybe other resources) to resource provider inventory
in this specific example that would not be a good idea. the V100 does not support mixing mdev types on the same gpu so if you allocate a V100D-16Q instace using 16G of the buffer you cannot then allocate 2 V100D-8Q vgpu instance to consume the remaining 16G other mdev based device may not have this limitation but nvida only support 1 active mdevtype per phsyical gpu.
not that the ampere generation has a dynmaic sriov based muilti instance gpu technology which kind of allow resouce based subdivietion of the device but it does not quite work the way you are describing above.
so you can report inventories of custom resource classes of for each of the mdev types or a single inventory of VGPU with traits modelling the avialble mdevs.
with the trait approch before a vgpu is allocated you report all traits and the total count for the inventory would be the hightest amount e.g. 32 in this case above then when a gpu is allocated you need to update the reserved value and remove the other traits.
[Feng, Shaohe] For V100 does not support mixing mdev types, that need to remove the other traits. So any suggestion about how a generic driver support both mixing types mdev and single type mdev?
if you have 1 inventory per mdev type then you set reserved = total for all inventories for the other mdev types but no need for traits. [Feng, Shaohe] Oh, really sorry, I should choose a good example.
the sample mdpy kernel module which create a basic virtual graphice deice support multiple mdev type for different resolutions https://github.com/torvalds/linux/blob/f97c81dc6ca5996560b3944064f63fc87eb18... i belive it also support consuming each mdev type independtly. so if you dont want to use real hardware as an example ther eare at least sample devicce that support having multiple active mdevs. i would also suggest we use this device for testing in the upstream gate. i started creating a jobs to test novas vgpu support with this smaple device a few months back but we need to make a few small change to make it work and i was not sure it was approate to modify the nova code just to get ci working with a fake device. currently we make 1 assumtion about that parent of the mdev being a pci deivce which is not true in the kernel module case. but form a cyborg perspective you can avoid that mistake since mdevs can be created for device on any bus like usb or upi as well as pcie. [Feng, Shaohe] Good suggestion.
3. driver should also support a function to create certain mdev type, such as (V100D-1Q, V100D-2Q,) Secondly, we need a mdev extend ARQ(it can be a plugin):
no it should not be a plugin it should jsut be another attachment type.
Here is an example for fpga ext arq: https://review.opendev.org/#/c/681005/26/cyborg/objects/extarq/fpga_ ex t_arq.py@206 The difference is that, we replace the _do_programming to _do_create_mdev For _do_programming, it is used to create a new FPGA function. For _do_create_mdev, it is used to create a new type mdev, it will call the implementation function in vendor driver.
well not really it will echo a uuid into a file in sysfs, that will triger the vendor driver to create the mdev but cyborg is not actully linking to the vendor driver and invoking a c function directly.
At last we need to support a mdev handler for xml generation in nova, we can refer to the cyborg PCI handler in nova
yes i brought this up in the nova ptg as something we will ikely need to add supprot for. as i was suggesting above this can be done by adding a new attachment type which is mdev that jsut contains the uuid instead of the pci address.
So after the above changes: Admin can create different SLA devices profiles such as: {“name”: “Gold_vGPU”, “groups”: [ {“resources:vGPU_BUFFERS”: “16”, “traits: V100D-16Q,”: “required”, }] } And {“name”: “Iron_vGPU”, “groups”: [ {“resources:vGPU_BUFFERS”: “1”, “traits: V100D-1Q,”: “required”, }] } Then a tenant can use Gold_vGPU to create with a VM with V100D-16Q vGPU And another tenant can use Iron_vGPU to create with a VM with V100D-1Q vGPU
it cannot do this on the same physical gpus but yes that could work or you could do {“name”: “Gold_vGPU”, “groups”: [{“resources:CUSTOM_V100D-16Q”: “1”,}]}
currently for nova we just do resouces:vgpu and you can optionally do trait:CUSTOM_
by the way we cannot use standard resouce classes or raits for the mdev types as these are arbitary stings chosen by vendor that can potential change based on kernel of driver version so we shoudl not add them to os-traits or os- resouce-classes and in stead shoudl use CUSTOM_ resouce classes ro traits for them. [Feng, Shaohe] Yes use use CUSTOM_ for them.
When ARQ binding during the VM creating, the Cyborg will call the vendor driver to create expected mdev vGPU. And these 2 mdev vGPU can be on same physical GPU card.
The mdev extend ARQ and vendor driver can be plugin, they are loose couple with the upstream code.
it should not be a plugin its a generic virtualisation attachment mode that can be used by any device we shoudl standardise the attachemt handel in core cyborg and add support for that attachmet model in nova. we already have support for generating the mdev xml so we would only need to wire up the handeling for the attachment type in the code that currently handles the pci attahcmeht type. [Feng, Shaohe] Yes, we will support stand generic ARQ. Only extend ARQ for some special accelerators, FPGA is an example.
i am not sure we need to extend ARQ for FPGA but perhaps at some point. nova does nto support plugins in general so as long as the info we need to request or recive does not vary based on cyborg plugins i guess it coudl be ok but outside of the driver layer i woudl perfosnally avoid intoduciing plugins to cyborg.
So the downstream can get the upstream code to customize the own mdev extend ARQ and vendor driver. Here vGPU is just an example, it can be other mdev devices.
yes so because this can be used for other device that is why i would counter propose that we should create a stateless mdev driver to cover devices that do not require programming or state managment and have a config driven interface to map mdev types to custom resouce classes and/or triats and we also need to declare per device it it support independent pools of each mdev type or if they consome the same ressource. .i.e. to it work like nvidea's v100s where you can only have 1 mdev type per physical device or if it works like the same device where you can have 1 device and multiple mdev types that can be consumed in parallel.
both approches are valid although i personally prefer when they are independent pools since that is eaier to reason about.
you could also support a non config driven approch where we use atibutes on the deployable to describe the mapping of the mdev type to resouce class and if its independetly consumable too i guess but that seams much more combersome to manage. [Feng, Shaohe] Both stateless mdev driver and non config driven can be support. If these cannot satisfy the users, users add their special mdev driver by themselves. well if i put my downstream had on in terms of productization we are currntly debating if we want to include cyborg in a futrue release of Redhat openstack plathform. at the moment it is not planed for our next major release winch is osp 17 and is in consideration for osp 18. one of the concerns we have with adding cyborg to a futre releae is the lack of a config driven approch. it is not a blocker but haveing an api only based apporch whil it has some advantages also has several drawback not lest of which is supportablity and day 1 and day 2 operational complexity.
for example we have a tool that customer use when reporting bugs called sosreport which automates teh collection of logs config and other system inforamtion. it can also invoke some command like virsh list etc but adding a new config and log to collect is signifcantly less work then adding a new command that needs to discover the node uuid then quiry placement and cyborg apis to determin what acclereator are avaiable and how they are configred. so api only configred service are harder for distos to support from a day perspective when things go wrong. form a day one persepctive it is also simpler for installer tools to template out a config file then it is to inovke commands imperitivaly against an api. so i understand that form an operators perpesctive invoking an api to do all config managment remotely might be quite appeling but it comes with trade offs. upstreasm should not be governed by downstream distro concerns but we should be mindful not to make it overly hard to deploy and manage the service as it raise the barrier to integrate and if that is too high then that can restul in not being supported in the long run. that said having a service that you just deploy and have no configuration to do would be nice too but if we tell our user that after they deploy cyborg the must then iterate of every deployable and decide dit they should enable it and what atibutes to add to it to schduler correctly i think the complexity might be to high for many.
BR Shaohe Feng
On Thu, 2020-06-25 at 00:31 +0000, Feng, Shaohe wrote:
Hi Yumeng and Xin-ran: Not sure you noticed that Sean Mooney has brought up that nova support mdev attachment type in the nova PTG, as follow:
yes i brought this up in the nova ptg as something we will ikely need to add supprot for. as i was suggesting above this can be done by adding a new attachment type which is mdev that jsut contains the uuid instead of the pci address.
IMHO, cyborg can support a mdev fake driver(similar to the current FPGA fake driver) for the mdev attachment type support in nova Or maybe we can extend the current fake driver support both mdev and pci devices cyborg could support a fake mdev driver yes however i do think adding support do deploy with the mdpy drvier also makes sense. vm booted with a gpu backed by an mdpy mdev actually get a functionality frame buffer and you can view it in the default vnc console. i have manually verifed this.
i think this is out of scope of cyborg however. what i was planning to do if i ever get the time is is to create an mdpy devstack plug in that would compile and deploy the kernel module. with aht we can just add the plugin to a zuul job and hten whitelist the mdev types eitehr in nova or cyborg to do testing with it and the proposed stateless mdev driver. in parralel if the fake driver can be exteded to support mdev attachmets we can also use that as another way to validate the interaction. the difference between the two arrpochse is using the mdpy or mtty kernel moduels would allow us to actully use mdev attachmetn adn add the mdev to a vm where as for the fake driver approch we would not add the fake mdev to a the libvirt xml or to the vm. the mdpy and mtty sample kernel modules are intended for use in testing the mdev frame work without any hardware requiremnt so i think that is perfect for our ci system. its more work then just extending the fake driver but it has several benifits.
BR Shaohe
-----Original Message----- From: Sean Mooney <smooney@redhat.com> Sent: 2020年6月23日 21:38 To: Feng, Shaohe <shaohe.feng@intel.com>; openstack-discuss@lists.openstack.org Cc: yumeng_bao@yahoo.com; shhfeng@126.com Subject: Re: [cyborg][nova] Support flexible use scenario for Mdev(such as vGPU)
On Tue, 2020-06-23 at 12:25 +0000, Feng, Shaohe wrote:
-----Original Message----- From: Sean Mooney <smooney@redhat.com> Sent: 2020年6月23日 19:48 To: Feng, Shaohe <shaohe.feng@intel.com>; openstack-discuss@lists.openstack.org Cc: yumeng_bao@yahoo.com; shhfeng@126.com Subject: Re: [cyborg][nova] Support flexible use scenario for Mdev(such as vGPU)
On Tue, 2020-06-23 at 05:50 +0000, Feng, Shaohe wrote:
Hi all,
Currently openstack support vGPU as follow: https://docs.openstack.org/nova/latest/admin/virtual-gpu.html
In order to support it, admin should plan ahead and configure the vGPU before deployable as follow: https://docs.openstack.org/nova/latest/configuration/config.html#dev ic es.enabled_vgpu_types This is very inconvenient for the administrator, this method has a limitation that a same PCI address does not provide two different types.
that is a matter of perspective there are those that prefer to check all fo there configuration into git and have a declaritive deployment and those that wish to drive everything via the api for the latter having to confiugre availbe resouce via cybogs api or placemtn would be consider very inconvenient.
Cyborg as an accelerator management tool is more suitable for mdev device management.
maybe but maybe not. i do think that cyborg should support mdevs i do not think we should have a dedicated vgpu mdev driver however. i think we should crate a stateless mdev driver that uses a similar whitelist of allowed mdevtypes and devices.
we did breifly discuss adding generic mdev support to nova in a future releae (w) or if we shoudl delegate that to cyborg but i would hesitate to do that if cyborg continutes its current design where driver have no configuraton element to whitelist device as it makes it much harder for deployment tools to properly configure a host declaritivly. [Feng, Shaohe] We did support config, such as our demo for fpga pre-program, we support config for our new drivers. And such as other accelerators, maybe the infra also need accelerators for acceleration not only VM needs. For example, cinder can use QAT for compress/crypto, and VM also can QAT. We need to configure which QATs are for infra and which for VMs.
yes qat is a good examlple of where shareign between host useage(cinder service) and guest usage(vms) could be required the sam could be true of gpus. typically many servers run headleas but not always and sometime you will want to resrve a gpu for the host to use. nics are another good example wehen we look at generic cpi passthough we need to select whic nics will be used by the vms and which will be used for host connectivity or fo hardware offloed ovs.
One solution as follow: Firstly, we need a vender driver(this can be a plugin), it is used to discovery its special devices and report them to placement for schedule. The difference from the current implementation is that: 1. report the mdev_supported_types as traits to resource provider. How to discover a GPU type: $ ls /sys/class/mdev_bus/*/mdev_supported_types /sys/class/mdev_bus/0000:84:00.0/mdev_supported_types: nvidia-35 nvidia-36 nvidia-37 nvidia-38 nvidia-39 nvidia-40 nvidia-41 nvidia-42 nvidia-43 nvidia-44 nvidia- 45 so here we report nvidia-3*, nvidia-4* as traits to resource provider. 2. Report the number of allocable resources instead of vGPU unit numbers to resource provider inventory Example for the NVidia V100 PCIe card (one GPU per board) : Virtual GPU Type Frame Buffer (Gbytes) Maximum vGPUs per GPU Maximum vGPUs per Board V100D-32Q 32 1 1 V100D-16Q 16 2 2 V100D-8Q 8 4 4 V100D-4Q 4 8 8 V100D-2Q 2 16 16 V100D-1Q 1 32 32 so here we report 32G Buffers(an example, maybe other resources) to resource provider inventory
in this specific example that would not be a good idea. the V100 does not support mixing mdev types on the same gpu so if you allocate a V100D-16Q instace using 16G of the buffer you cannot then allocate 2 V100D-8Q vgpu instance to consume the remaining 16G other mdev based device may not have this limitation but nvida only support 1 active mdevtype per phsyical gpu.
not that the ampere generation has a dynmaic sriov based muilti instance gpu technology which kind of allow resouce based subdivietion of the device but it does not quite work the way you are describing above.
so you can report inventories of custom resource classes of for each of the mdev types or a single inventory of VGPU with traits modelling the avialble mdevs.
with the trait approch before a vgpu is allocated you report all traits and the total count for the inventory would be the hightest amount e.g. 32 in this case above then when a gpu is allocated you need to update the reserved value and remove the other traits.
[Feng, Shaohe] For V100 does not support mixing mdev types, that need to remove the other traits. So any suggestion about how a generic driver support both mixing types mdev and single type mdev?
if you have 1 inventory per mdev type then you set reserved = total for all inventories for the other mdev types but no need for traits. [Feng, Shaohe] Oh, really sorry, I should choose a good example.
the sample mdpy kernel module which create a basic virtual graphice deice support multiple mdev type for different resolutions https://github.com/torvalds/linux/blob/f97c81dc6ca5996560b3944064f63fc87eb18... i belive it also support consuming each mdev type independtly. so if you dont want to use real hardware as an example ther eare at least sample devicce that support having multiple active mdevs. i would also suggest we use this device for testing in the upstream gate.
i started creating a jobs to test novas vgpu support with this smaple device a few months back but we need to make a few small change to make it work and i was not sure it was approate to modify the nova code just to get ci working with a fake device. currently we make 1 assumtion about that parent of the mdev being a pci deivce which is not true in the kernel module case.
but form a cyborg perspective you can avoid that mistake since mdevs can be created for device on any bus like usb or upi as well as pcie. [Feng, Shaohe] Good suggestion.
3. driver should also support a function to create certain mdev type, such as (V100D-1Q, V100D-2Q,) Secondly, we need a mdev extend ARQ(it can be a plugin):
no it should not be a plugin it should jsut be another attachment type.
Here is an example for fpga ext arq: https://review.opendev.org/#/c/681005/26/cyborg/objects/extarq/fpga_ ex t_arq.py@206 The difference is that, we replace the _do_programming to _do_create_mdev For _do_programming, it is used to create a new FPGA function. For _do_create_mdev, it is used to create a new type mdev, it will call the implementation function in vendor driver.
well not really it will echo a uuid into a file in sysfs, that will triger the vendor driver to create the mdev but cyborg is not actully linking to the vendor driver and invoking a c function directly.
At last we need to support a mdev handler for xml generation in nova, we can refer to the cyborg PCI handler in nova
yes i brought this up in the nova ptg as something we will ikely need to add supprot for. as i was suggesting above this can be done by adding a new attachment type which is mdev that jsut contains the uuid instead of the pci address.
So after the above changes: Admin can create different SLA devices profiles such as: {“name”: “Gold_vGPU”, “groups”: [ {“resources:vGPU_BUFFERS”: “16”, “traits: V100D-16Q,”: “required”, }] } And {“name”: “Iron_vGPU”, “groups”: [ {“resources:vGPU_BUFFERS”: “1”, “traits: V100D-1Q,”: “required”, }] } Then a tenant can use Gold_vGPU to create with a VM with V100D-16Q vGPU And another tenant can use Iron_vGPU to create with a VM with V100D-1Q vGPU
it cannot do this on the same physical gpus but yes that could work or you could do {“name”: “Gold_vGPU”, “groups”: [{“resources:CUSTOM_V100D-16Q”: “1”,}]}
currently for nova we just do resouces:vgpu and you can optionally do trait:CUSTOM_
by the way we cannot use standard resouce classes or raits for the mdev types as these are arbitary stings chosen by vendor that can potential change based on kernel of driver version so we shoudl not add them to os-traits or os- resouce-classes and in stead shoudl use CUSTOM_ resouce classes ro traits for them. [Feng, Shaohe] Yes use use CUSTOM_ for them.
When ARQ binding during the VM creating, the Cyborg will call the vendor driver to create expected mdev vGPU. And these 2 mdev vGPU can be on same physical GPU card.
The mdev extend ARQ and vendor driver can be plugin, they are loose couple with the upstream code.
it should not be a plugin its a generic virtualisation attachment mode that can be used by any device we shoudl standardise the attachemt handel in core cyborg and add support for that attachmet model in nova. we already have support for generating the mdev xml so we would only need to wire up the handeling for the attachment type in the code that currently handles the pci attahcmeht type. [Feng, Shaohe] Yes, we will support stand generic ARQ. Only extend ARQ for some special accelerators, FPGA is an example.
i am not sure we need to extend ARQ for FPGA but perhaps at some point. nova does nto support plugins in general so as long as the info we need to request or recive does not vary based on cyborg plugins i guess it coudl be ok but outside of the driver layer i woudl perfosnally avoid intoduciing plugins to cyborg.
So the downstream can get the upstream code to customize the own mdev extend ARQ and vendor driver. Here vGPU is just an example, it can be other mdev devices.
yes so because this can be used for other device that is why i would counter propose that we should create a stateless mdev driver to cover devices that do not require programming or state managment and have a config driven interface to map mdev types to custom resouce classes and/or triats and we also need to declare per device it it support independent pools of each mdev type or if they consome the same ressource. .i.e. to it work like nvidea's v100s where you can only have 1 mdev type per physical device or if it works like the same device where you can have 1 device and multiple mdev types that can be consumed in parallel.
both approches are valid although i personally prefer when they are independent pools since that is eaier to reason about.
you could also support a non config driven approch where we use atibutes on the deployable to describe the mapping of the mdev type to resouce class and if its independetly consumable too i guess but that seams much more combersome to manage. [Feng, Shaohe] Both stateless mdev driver and non config driven can be support. If these cannot satisfy the users, users add their special mdev driver by themselves.
well if i put my downstream had on in terms of productization we are currntly debating if we want to include cyborg in a futrue release of Redhat openstack plathform. at the moment it is not planed for our next major release winch is osp 17 and is in consideration for osp 18. one of the concerns we have with adding cyborg to a futre releae is the lack of a config driven approch. it is not a blocker but haveing an api only based apporch whil it has some advantages also has several drawback not lest of which is supportablity and day 1 and day 2 operational complexity.
for example we have a tool that customer use when reporting bugs called sosreport which automates teh collection of logs config and other system inforamtion. it can also invoke some command like virsh list etc but adding a new config and log to collect is signifcantly less work then adding a new command that needs to discover the node uuid then quiry placement and cyborg apis to determin what acclereator are avaiable and how they are configred. so api only configred service are harder for distos to support from a day perspective when things go wrong. form a day one persepctive it is also simpler for installer tools to template out a config file then it is to inovke commands imperitivaly against an api.
so i understand that form an operators perpesctive invoking an api to do all config managment remotely might be quite appeling but it comes with trade offs. upstreasm should not be governed by downstream distro concerns but we should be mindful not to make it overly hard to deploy and manage the service as it raise the barrier to integrate and if that is too high then that can restul in not being supported in the long run. that said having a service that you just deploy and have no configuration to do would be nice too but if we tell our user that after they deploy cyborg the must then iterate of every deployable and decide dit they should enable it and what atibutes to add to it to schduler correctly i think the complexity might be to high for many.
BR Shaohe Feng
Hi Shaohe and Sean, Thanks for bringing up this discussion. 1. about the mdev whitelist, I agree and support the idea that cyborg should create a generic mdev driver and support the whitelist configuration of allowed mdevtypes and devices. 2. report the number of allocable resources to resource provider inventory I kind of prefer we report 1 inventory per VGPU. That is to say, admin config the supported mdevtype for the device in the cyborg.conf file, then cyborg report the avaliable_instance of the single selected mdevtype to resource provider inventory. For the alternative, if we report 1 inventory per mdevtype, that means: 1)we need to report all, but only have one available inventory, and set reserverd = total for all the rest mdev types 2)when admin re-config the mdevtype, we still need to update the newly-selected type to available,while others remain reserved. This sounds like we will report quite a few redundant data to placement. But actually, we care the inventory of the selected type more than other types. 3.1 driver should also support a function to create certain mdev type, such as (V100D-1Q, V100D-2Q,)
well not really it will echo a uuid into a file in sysfs, that will triger the vendor driver to create the mdev but cyborg is not actully linking to the vendor driver and invoking a c function directly.
yes, in the cyborg management cycle, when reporting a mdev device, it just generates uuid(s) for this mdev, but not really create the mdev. I think we can create mdevs by two possible ways: - solution1: A certain mdev type will not be actually created until nova-compute starts to write mdev info to XML and virt driver spawn VM. We can extend current accel_info attachement type to handle both add_accel_pci_device and add_accel_mdev_device in nova/virt/libvirt/driver.py, where before add_accel_mdev_device we can do create_mdev by just calling Nova.privsep.libvirt.create_mdev there. - solution2: during ARQ binding process, cyborg creates the mdev by the way similar to Nova.privsep.libvirt.create_mdev 3.2 and one more thing needs to be mentioned, that is we should avoid conflict with the existed mdev management logic in nova, so we may need to introduce a acc_mdev_flag here to check if the mdev request is from cyborg or not. 3.3 xml generation
At last we need to support a mdev handler for xml generation in nova, we can refer to the cyborg PCI handler in nova
yes i brought this up in the nova ptg as something we will ikely need to add supprot for. as i was suggesting above this can be done by adding a new attachment type which is mdev that jsut contains the uuid instead of the pci address.
+1.agree. supporting the new attachment type makes sense. 4. mdev fake driver support
IMHO, cyborg can support a mdev fake driver(similar to the current FPGA fake driver) for the mdev attachment type support in nova Or maybe we can extend the current fake driver support both mdev and pci devices
in parralel if the fake driver can be exteded to support mdev attachmets we can also use that as another way to validate the interaction. the difference between the two arrpochse is using the mdpy or mtty kernel moduels would allow us to actully use mdev attachmetn adn add the mdev to a vm where as for the fake driver approch we would not add the fake mdev to a the libvirt xml or to the vm. the mdpy and mtty sample kernel modules are intended for use in testing the mdev frame work without any hardware requiremnt so i think that is perfect for our ci system. its more work then just extending the fake driver but it has several benifits.
@Sean: "the mdpy and mtty sample kernel modules are intended for use in testing the mdev frame work without any hardware requiremnt so i think that is perfect for our ci system. " does this mean this can also be a easier way for third-party CI for mdev device? best regards, Yumeng On Thursday, June 25, 2020, 08:43:11 PM GMT+8, Sean Mooney <smooney@redhat.com> wrote: On Thu, 2020-06-25 at 00:31 +0000, Feng, Shaohe wrote:
Hi Yumeng and Xin-ran: Not sure you noticed that Sean Mooney has brought up that nova support mdev attachment type in the nova PTG, as follow:
yes i brought this up in the nova ptg as something we will ikely need to add supprot for. as i was suggesting above this can be done by adding a new attachment type which is mdev that jsut contains the uuid instead of the pci address.
IMHO, cyborg can support a mdev fake driver(similar to the current FPGA fake driver) for the mdev attachment type support in nova Or maybe we can extend the current fake driver support both mdev and pci devices cyborg could support a fake mdev driver yes however i do think adding support do deploy with the mdpy drvier also makes sense. vm booted with a gpu backed by an mdpy mdev actually get a functionality frame buffer and you can view it in the default vnc console. i have manually verifed this.
i think this is out of scope of cyborg however. what i was planning to do if i ever get the time is is to create an mdpy devstack plug in that would compile and deploy the kernel module. with aht we can just add the plugin to a zuul job and hten whitelist the mdev types eitehr in nova or cyborg to do testing with it and the proposed stateless mdev driver. in parralel if the fake driver can be exteded to support mdev attachmets we can also use that as another way to validate the interaction. the difference between the two arrpochse is using the mdpy or mtty kernel moduels would allow us to actully use mdev attachmetn adn add the mdev to a vm where as for the fake driver approch we would not add the fake mdev to a the libvirt xml or to the vm. the mdpy and mtty sample kernel modules are intended for use in testing the mdev frame work without any hardware requiremnt so i think that is perfect for our ci system. its more work then just extending the fake driver but it has several benifits.
BR Shaohe
-----Original Message----- From: Sean Mooney <smooney@redhat.com> Sent: 2020年6月23日 21:38 To: Feng, Shaohe <shaohe.feng@intel.com>; openstack-discuss@lists.openstack.org Cc: yumeng_bao@yahoo.com; shhfeng@126.com Subject: Re: [cyborg][nova] Support flexible use scenario for Mdev(such as vGPU)
On Tue, 2020-06-23 at 12:25 +0000, Feng, Shaohe wrote:
-----Original Message----- From: Sean Mooney <smooney@redhat.com> Sent: 2020年6月23日 19:48 To: Feng, Shaohe <shaohe.feng@intel.com>; openstack-discuss@lists.openstack.org Cc: yumeng_bao@yahoo.com; shhfeng@126.com Subject: Re: [cyborg][nova] Support flexible use scenario for Mdev(such as vGPU)
On Tue, 2020-06-23 at 05:50 +0000, Feng, Shaohe wrote:
Hi all,
Currently openstack support vGPU as follow: https://docs.openstack.org/nova/latest/admin/virtual-gpu.html
In order to support it, admin should plan ahead and configure the vGPU before deployable as follow: https://docs.openstack.org/nova/latest/configuration/config.html#dev ic es.enabled_vgpu_types This is very inconvenient for the administrator, this method has a limitation that a same PCI address does not provide two different types.
that is a matter of perspective there are those that prefer to check all fo there configuration into git and have a declaritive deployment and those that wish to drive everything via the api for the latter having to confiugre availbe resouce via cybogs api or placemtn would be consider very inconvenient.
Cyborg as an accelerator management tool is more suitable for mdev device management.
maybe but maybe not. i do think that cyborg should support mdevs i do not think we should have a dedicated vgpu mdev driver however. i think we should crate a stateless mdev driver that uses a similar whitelist of allowed mdevtypes and devices.
we did breifly discuss adding generic mdev support to nova in a future releae (w) or if we shoudl delegate that to cyborg but i would hesitate to do that if cyborg continutes its current design where driver have no configuraton element to whitelist device as it makes it much harder for deployment tools to properly configure a host declaritivly. [Feng, Shaohe] We did support config, such as our demo for fpga pre-program, we support config for our new drivers. And such as other accelerators, maybe the infra also need accelerators for acceleration not only VM needs. For example, cinder can use QAT for compress/crypto, and VM also can QAT. We need to configure which QATs are for infra and which for VMs.
yes qat is a good examlple of where shareign between host useage(cinder service) and guest usage(vms) could be required the sam could be true of gpus. typically many servers run headleas but not always and sometime you will want to resrve a gpu for the host to use. nics are another good example wehen we look at generic cpi passthough we need to select whic nics will be used by the vms and which will be used for host connectivity or fo hardware offloed ovs.
One solution as follow: Firstly, we need a vender driver(this can be a plugin), it is used to discovery its special devices and report them to placement for schedule. The difference from the current implementation is that: 1. report the mdev_supported_types as traits to resource provider. How to discover a GPU type: $ ls /sys/class/mdev_bus/*/mdev_supported_types /sys/class/mdev_bus/0000:84:00.0/mdev_supported_types: nvidia-35 nvidia-36 nvidia-37 nvidia-38 nvidia-39 nvidia-40 nvidia-41 nvidia-42 nvidia-43 nvidia-44 nvidia- 45 so here we report nvidia-3*, nvidia-4* as traits to resource provider. 2. Report the number of allocable resources instead of vGPU unit numbers to resource provider inventory Example for the NVidia V100 PCIe card (one GPU per board) : Virtual GPU Type Frame Buffer (Gbytes) Maximum vGPUs per GPU Maximum vGPUs per Board V100D-32Q 32 1 1 V100D-16Q 16 2 2 V100D-8Q 8 4 4 V100D-4Q 4 8 8 V100D-2Q 2 16 16 V100D-1Q 1 32 32 so here we report 32G Buffers(an example, maybe other resources) to resource provider inventory
in this specific example that would not be a good idea. the V100 does not support mixing mdev types on the same gpu so if you allocate a V100D-16Q instace using 16G of the buffer you cannot then allocate 2 V100D-8Q vgpu instance to consume the remaining 16G other mdev based device may not have this limitation but nvida only support 1 active mdevtype per phsyical gpu.
not that the ampere generation has a dynmaic sriov based muilti instance gpu technology which kind of allow resouce based subdivietion of the device but it does not quite work the way you are describing above.
so you can report inventories of custom resource classes of for each of the mdev types or a single inventory of VGPU with traits modelling the avialble mdevs.
with the trait approch before a vgpu is allocated you report all traits and the total count for the inventory would be the hightest amount e.g. 32 in this case above then when a gpu is allocated you need to update the reserved value and remove the other traits.
[Feng, Shaohe] For V100 does not support mixing mdev types, that need to remove the other traits. So any suggestion about how a generic driver support both mixing types mdev and single type mdev?
if you have 1 inventory per mdev type then you set reserved = total for all inventories for the other mdev types but no need for traits. [Feng, Shaohe] Oh, really sorry, I should choose a good example.
the sample mdpy kernel module which create a basic virtual graphice deice support multiple mdev type for different resolutions https://github.com/torvalds/linux/blob/f97c81dc6ca5996560b3944064f63fc87eb18... i belive it also support consuming each mdev type independtly. so if you dont want to use real hardware as an example ther eare at least sample devicce that support having multiple active mdevs. i would also suggest we use this device for testing in the upstream gate.
i started creating a jobs to test novas vgpu support with this smaple device a few months back but we need to make a few small change to make it work and i was not sure it was approate to modify the nova code just to get ci working with a fake device. currently we make 1 assumtion about that parent of the mdev being a pci deivce which is not true in the kernel module case.
but form a cyborg perspective you can avoid that mistake since mdevs can be created for device on any bus like usb or upi as well as pcie. [Feng, Shaohe] Good suggestion.
3. driver should also support a function to create certain mdev type, such as (V100D-1Q, V100D-2Q,) Secondly, we need a mdev extend ARQ(it can be a plugin):
no it should not be a plugin it should jsut be another attachment type.
Here is an example for fpga ext arq: https://review.opendev.org/#/c/681005/26/cyborg/objects/extarq/fpga_ ex t_arq.py@206 The difference is that, we replace the _do_programming to _do_create_mdev For _do_programming, it is used to create a new FPGA function. For _do_create_mdev, it is used to create a new type mdev, it will call the implementation function in vendor driver.
well not really it will echo a uuid into a file in sysfs, that will triger the vendor driver to create the mdev but cyborg is not actully linking to the vendor driver and invoking a c function directly.
At last we need to support a mdev handler for xml generation in nova, we can refer to the cyborg PCI handler in nova
yes i brought this up in the nova ptg as something we will ikely need to add supprot for. as i was suggesting above this can be done by adding a new attachment type which is mdev that jsut contains the uuid instead of the pci address.
So after the above changes: Admin can create different SLA devices profiles such as: {“name”: “Gold_vGPU”, “groups”: [ {“resources:vGPU_BUFFERS”: “16”, “traits: V100D-16Q,”: “required”, }] } And {“name”: “Iron_vGPU”, “groups”: [ {“resources:vGPU_BUFFERS”: “1”, “traits: V100D-1Q,”: “required”, }] } Then a tenant can use Gold_vGPU to create with a VM with V100D-16Q vGPU And another tenant can use Iron_vGPU to create with a VM with V100D-1Q vGPU
it cannot do this on the same physical gpus but yes that could work or you could do {“name”: “Gold_vGPU”, “groups”: [{“resources:CUSTOM_V100D-16Q”: “1”,}]}
currently for nova we just do resouces:vgpu and you can optionally do trait:CUSTOM_
by the way we cannot use standard resouce classes or raits for the mdev types as these are arbitary stings chosen by vendor that can potential change based on kernel of driver version so we shoudl not add them to os-traits or os- resouce-classes and in stead shoudl use CUSTOM_ resouce classes ro traits for them. [Feng, Shaohe] Yes use use CUSTOM_ for them.
When ARQ binding during the VM creating, the Cyborg will call the vendor driver to create expected mdev vGPU. And these 2 mdev vGPU can be on same physical GPU card.
The mdev extend ARQ and vendor driver can be plugin, they are loose couple with the upstream code.
it should not be a plugin its a generic virtualisation attachment mode that can be used by any device we shoudl standardise the attachemt handel in core cyborg and add support for that attachmet model in nova. we already have support for generating the mdev xml so we would only need to wire up the handeling for the attachment type in the code that currently handles the pci attahcmeht type. [Feng, Shaohe] Yes, we will support stand generic ARQ. Only extend ARQ for some special accelerators, FPGA is an example.
i am not sure we need to extend ARQ for FPGA but perhaps at some point. nova does nto support plugins in general so as long as the info we need to request or recive does not vary based on cyborg plugins i guess it coudl be ok but outside of the driver layer i woudl perfosnally avoid intoduciing plugins to cyborg.
So the downstream can get the upstream code to customize the own mdev extend ARQ and vendor driver. Here vGPU is just an example, it can be other mdev devices.
yes so because this can be used for other device that is why i would counter propose that we should create a stateless mdev driver to cover devices that do not require programming or state managment and have a config driven interface to map mdev types to custom resouce classes and/or triats and we also need to declare per device it it support independent pools of each mdev type or if they consome the same ressource. .i.e. to it work like nvidea's v100s where you can only have 1 mdev type per physical device or if it works like the same device where you can have 1 device and multiple mdev types that can be consumed in parallel.
both approches are valid although i personally prefer when they are independent pools since that is eaier to reason about.
you could also support a non config driven approch where we use atibutes on the deployable to describe the mapping of the mdev type to resouce class and if its independetly consumable too i guess but that seams much more combersome to manage. [Feng, Shaohe] Both stateless mdev driver and non config driven can be support. If these cannot satisfy the users, users add their special mdev driver by themselves.
well if i put my downstream had on in terms of productization we are currntly debating if we want to include cyborg in a futrue release of Redhat openstack plathform. at the moment it is not planed for our next major release winch is osp 17 and is in consideration for osp 18. one of the concerns we have with adding cyborg to a futre releae is the lack of a config driven approch. it is not a blocker but haveing an api only based apporch whil it has some advantages also has several drawback not lest of which is supportablity and day 1 and day 2 operational complexity.
for example we have a tool that customer use when reporting bugs called sosreport which automates teh collection of logs config and other system inforamtion. it can also invoke some command like virsh list etc but adding a new config and log to collect is signifcantly less work then adding a new command that needs to discover the node uuid then quiry placement and cyborg apis to determin what acclereator are avaiable and how they are configred. so api only configred service are harder for distos to support from a day perspective when things go wrong. form a day one persepctive it is also simpler for installer tools to template out a config file then it is to inovke commands imperitivaly against an api.
so i understand that form an operators perpesctive invoking an api to do all config managment remotely might be quite appeling but it comes with trade offs. upstreasm should not be governed by downstream distro concerns but we should be mindful not to make it overly hard to deploy and manage the service as it raise the barrier to integrate and if that is too high then that can restul in not being supported in the long run. that said having a service that you just deploy and have no configuration to do would be nice too but if we tell our user that after they deploy cyborg the must then iterate of every deployable and decide dit they should enable it and what atibutes to add to it to schduler correctly i think the complexity might be to high for many.
BR Shaohe Feng
On Fri, 2020-06-26 at 05:39 +0000, yumeng bao wrote:
Hi Shaohe and Sean,
Thanks for bringing up this discussion.
1. about the mdev whitelist, I agree and support the idea that cyborg should create a generic mdev driver and support the whitelist configuration of allowed mdevtypes and devices.
2. report the number of allocable resources to resource provider inventory I kind of prefer we report 1 inventory per VGPU. That is to say, admin config the supported mdevtype for the device in the cyborg.conf file, then cyborg report the avaliable_instance of the single selected mdevtype to resource provider inventory.
For the alternative, if we report 1 inventory per mdevtype, that means: 1)we need to report all, but only have one available inventory, and set reserverd = total for all the rest mdev types 2)when admin re-config the mdevtype, we still need to update the newly-selected type to available,while others remain reserved. This sounds like we will report quite a few redundant data to placement. But actually, we care the inventory of the selected type more than other types.
3.1 driver should also support a function to create certain mdev type, such as (V100D-1Q, V100D-2Q,)
well not really it will echo a uuid into a file in sysfs, that will triger the vendor driver to create the mdev but cyborg is not actully linking to the vendor driver and invoking a c function directly.
yes, in the cyborg management cycle, when reporting a mdev device, it just generates uuid(s) for this mdev, but not really create the mdev. I think we can create mdevs by two possible ways:
yes i think both will work
- solution1: A certain mdev type will not be actually created until nova-compute starts to write mdev info to XML and virt driver spawn VM. We can extend current accel_info attachement type to handle both add_accel_pci_device and add_accel_mdev_device in nova/virt/libvirt/driver.py, where before add_accel_mdev_device we can do create_mdev by just calling Nova.privsep.libvirt.create_mdev there.
in this case we would proably want the mdev attachemtn type to contain to the uuid of the mdev to create
- solution2: during ARQ binding process, cyborg creates the mdev by the way similar to Nova.privsep.libvirt.create_mdev
and in this case the attahcmet type would contaienr the mdev path or uuid this is my perfernce as it allows cyborg to create stable mdevs across reboot if it wants too. the mdev uuid can change in novas current code. at the moement that does not matter too much but it might be nice to for example use the deployable objectect uuid as the mdev uuid that way it would be easy to corratlte between the two.
3.2 and one more thing needs to be mentioned, that is we should avoid conflict with the existed mdev management logic in nova, so we may need to introduce a acc_mdev_flag here to check if the mdev request is from cyborg or not.
3.3 xml generation
At last we need to support a mdev handler for xml generation in nova, we can refer to the cyborg PCI handler in nova
yes i brought this up in the nova ptg as something we will ikely need to add supprot for. as i was suggesting above this can be done by adding a new attachment type which is mdev that jsut contains the uuid instead of the pci address.
+1.agree. supporting the new attachment type makes sense.
4. mdev fake driver support
IMHO, cyborg can support a mdev fake driver(similar to the current FPGA fake driver) for the mdev attachment type support in nova Or maybe we can extend the current fake driver support both mdev and pci devices
in parralel if the fake driver can be exteded to support mdev attachmets we can also use that as another way to validate the interaction. the difference between the two arrpochse is using the mdpy or mtty kernel moduels would allow us to actully use mdev attachmetn adn add the mdev to a vm where as for the fake driver approch we would not add the fake mdev to a the libvirt xml or to the vm. the mdpy and mtty sample kernel modules are intended for use in testing the mdev frame work without any hardware requiremnt so i think that is perfect for our ci system. its more work then just extending the fake driver but it has several benifits.
@Sean: "the mdpy and mtty sample kernel modules are intended for use in testing the mdev frame work without any hardware requiremnt so i think that is perfect for our ci system. " does this mean this can also be a easier way for third-party CI for mdev device?
i think if w euse the mdpy or mtty sample kernel modeuls we would not need a third party ci and could fully test mdev support in the first party ci. third party ci would only then be required for stateful mdev device that needed the a custom driver to do some intial programing or cleanup of the device that the generic mdev driver could not do. the mtty sample module https://github.com/torvalds/linux/blob/master/samples/vfio-mdev/mtty.c emulates a virtual serial prot that acts basically as an echo server. if you create the device # echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1001" > \ /sys/devices/virtual/mtty/mtty/mdev_supported_types/mtty-2/create and add it to qemu -device vfio-pci,\ sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 then it will show up in the guest as a pci device with vendor and product id 4348:3253 # lspci -s 00:05.0 -xxvv 00:05.0 Serial controller: Device 4348:3253 (rev 10) (prog-if 02 [16550]) Subsystem: Device 4348:3253 Physical Slot: 5 Control: I/O+ Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Interrupt: pin A routed to IRQ 10 Region 0: I/O ports at c150 [size=8] Region 1: I/O ports at c158 [size=8] Kernel driver in use: serial 00: 48 43 53 32 01 00 00 02 10 02 00 07 00 00 00 00 10: 51 c1 00 00 59 c1 00 00 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 48 43 53 32 30: 00 00 00 00 00 00 00 00 00 00 00 00 0a 01 00 00 In the Linux guest VM, dmesg output for the device is as follows: serial 0000:00:05.0: PCI INT A -> Link[LNKA] -> GSI 10 (level, high) -> IRQ 10 0000:00:05.0: ttyS1 at I/O 0xc150 (irq = 10) is a 16550A 0000:00:05.0: ttyS2 at I/O 0xc158 (irq = 10) is a 16550A you can then use minicom or any serial console application to write data to ttyS1 or ttyS2 in the guest and the host mdev module will loop it back ascting as an echo server. so we shoudl be able to add an optional tempest test to the cyborg tempest plugin to fully validate end to end fuctioning of generic mdev support including sshing into a vm that is using an mtty serial port and validating it loops back data fully testing the fature in the first party ci. all we need to do is compile an modprob the mtty device using a devstack plugin in the gate job.
best regards, Yumeng
On Thursday, June 25, 2020, 08:43:11 PM GMT+8, Sean Mooney <smooney@redhat.com> wrote:
On Thu, 2020-06-25 at 00:31 +0000, Feng, Shaohe wrote:
Hi Yumeng and Xin-ran: Not sure you noticed that Sean Mooney has brought up that nova support mdev attachment type in the nova PTG, as follow:
yes i brought this up in the nova ptg as something we will ikely need to add supprot for. as i was suggesting above this can be done by adding a new attachment type which is mdev that jsut contains the uuid instead of the pci address.
IMHO, cyborg can support a mdev fake driver(similar to the current FPGA fake driver) for the mdev attachment type support in nova Or maybe we can extend the current fake driver support both mdev and pci devices
cyborg could support a fake mdev driver yes however i do think adding support do deploy with the mdpy drvier also makes sense. vm booted with a gpu backed by an mdpy mdev actually get a functionality frame buffer and you can view it in the default vnc console. i have manually verifed this.
i think this is out of scope of cyborg however. what i was planning to do if i ever get the time is is to create an mdpy devstack plug in that would compile and deploy the kernel module. with aht we can just add the plugin to a zuul job and hten whitelist the mdev types eitehr in nova or cyborg to do testing with it and the proposed stateless mdev driver.
in parralel if the fake driver can be exteded to support mdev attachmets we can also use that as another way to validate the interaction. the difference between the two arrpochse is using the mdpy or mtty kernel moduels would allow us to actully use mdev attachmetn adn add the mdev to a vm where as for the fake driver approch we would not add the fake mdev to a the libvirt xml or to the vm. the mdpy and mtty sample kernel modules are intended for use in testing the mdev frame work without any hardware requiremnt so i think that is perfect for our ci system. its more work then just extending the fake driver but it has several benifits.
BR Shaohe
-----Original Message----- From: Sean Mooney <smooney@redhat.com> Sent: 2020年6月23日 21:38 To: Feng, Shaohe <shaohe.feng@intel.com>; openstack-discuss@lists.openstack.org Cc: yumeng_bao@yahoo.com; shhfeng@126.com Subject: Re: [cyborg][nova] Support flexible use scenario for Mdev(such as vGPU)
On Tue, 2020-06-23 at 12:25 +0000, Feng, Shaohe wrote:
-----Original Message----- From: Sean Mooney <smooney@redhat.com> Sent: 2020年6月23日 19:48 To: Feng, Shaohe <shaohe.feng@intel.com>; openstack-discuss@lists.openstack.org Cc: yumeng_bao@yahoo.com; shhfeng@126.com Subject: Re: [cyborg][nova] Support flexible use scenario for Mdev(such as vGPU)
On Tue, 2020-06-23 at 05:50 +0000, Feng, Shaohe wrote:
Hi all,
Currently openstack support vGPU as follow: https://docs.openstack.org/nova/latest/admin/virtual-gpu.html
In order to support it, admin should plan ahead and configure the vGPU before deployable as follow: https://docs.openstack.org/nova/latest/configuration/config.html#dev ic es.enabled_vgpu_types This is very inconvenient for the administrator, this method has a limitation that a same PCI address does not provide two different types.
that is a matter of perspective there are those that prefer to check all fo there configuration into git and have a declaritive deployment and those that wish to drive everything via the api for the latter having to confiugre availbe resouce via cybogs api or placemtn would be consider very inconvenient.
Cyborg as an accelerator management tool is more suitable for mdev device management.
maybe but maybe not. i do think that cyborg should support mdevs i do not think we should have a dedicated vgpu mdev driver however. i think we should crate a stateless mdev driver that uses a similar whitelist of allowed mdevtypes and devices.
we did breifly discuss adding generic mdev support to nova in a future releae (w) or if we shoudl delegate that to cyborg but i would hesitate to do that if cyborg continutes its current design where driver have no configuraton element to whitelist device as it makes it much harder for deployment tools to properly configure a host declaritivly. [Feng, Shaohe] We did support config, such as our demo for fpga pre-program, we support config for our new drivers. And such as other accelerators, maybe the infra also need accelerators for acceleration not only VM needs. For example, cinder can use QAT for compress/crypto, and VM also can QAT. We need to configure which QATs are for infra and which for VMs.
yes qat is a good examlple of where shareign between host useage(cinder service) and guest usage(vms) could be required the sam could be true of gpus. typically many servers run headleas but not always and sometime you will want to resrve a gpu for the host to use. nics are another good example wehen we look at generic cpi passthough we need to select whic nics will be used by the vms and which will be used for host connectivity or fo hardware offloed ovs.
One solution as follow: Firstly, we need a vender driver(this can be a plugin), it is used to discovery its special devices and report them to placement for schedule. The difference from the current implementation is that: 1. report the mdev_supported_types as traits to resource provider. How to discover a GPU type: $ ls /sys/class/mdev_bus/*/mdev_supported_types /sys/class/mdev_bus/0000:84:00.0/mdev_supported_types: nvidia-35 nvidia-36 nvidia-37 nvidia-38 nvidia-39 nvidia-40 nvidia-41 nvidia-42 nvidia-43 nvidia-44 nvidia- 45 so here we report nvidia-3*, nvidia-4* as traits to resource provider. 2. Report the number of allocable resources instead of vGPU unit numbers to resource provider inventory Example for the NVidia V100 PCIe card (one GPU per board) : Virtual GPU Type Frame Buffer (Gbytes) Maximum vGPUs per GPU Maximum vGPUs per Board V100D-32Q 32 1 1 V100D-16Q 16 2 2 V100D-8Q 8 4 4 V100D-4Q 4 8 8 V100D-2Q 2 16 16 V100D-1Q 1 32 32 so here we report 32G Buffers(an example, maybe other resources) to resource provider inventory
in this specific example that would not be a good idea. the V100 does not support mixing mdev types on the same gpu so if you allocate a V100D-16Q instace using 16G of the buffer you cannot then allocate 2 V100D-8Q vgpu instance to consume the remaining 16G other mdev based device may not have this limitation but nvida only support 1 active mdevtype per phsyical gpu.
not that the ampere generation has a dynmaic sriov based muilti instance gpu technology which kind of allow resouce based subdivietion of the device but it does not quite work the way you are describing above.
so you can report inventories of custom resource classes of for each of the mdev types or a single inventory of VGPU with traits modelling the avialble mdevs.
with the trait approch before a vgpu is allocated you report all traits and the total count for the inventory would be the hightest amount e.g. 32 in this case above then when a gpu is allocated you need to update the reserved value and remove the other traits.
[Feng, Shaohe] For V100 does not support mixing mdev types, that need to remove the other traits. So any suggestion about how a generic driver support both mixing types mdev and single type mdev?
if you have 1 inventory per mdev type then you set reserved = total for all inventories for the other mdev types but no need for traits. [Feng, Shaohe] Oh, really sorry, I should choose a good example.
the sample mdpy kernel module which create a basic virtual graphice deice support multiple mdev type for different resolutions https://github.com/torvalds/linux/blob/f97c81dc6ca5996560b3944064f63fc87eb18... i belive it also support consuming each mdev type independtly. so if you dont want to use real hardware as an example ther eare at least sample devicce that support having multiple active mdevs. i would also suggest we use this device for testing in the upstream gate.
i started creating a jobs to test novas vgpu support with this smaple device a few months back but we need to make a few small change to make it work and i was not sure it was approate to modify the nova code just to get ci working with a fake device. currently we make 1 assumtion about that parent of the mdev being a pci deivce which is not true in the kernel module case.
but form a cyborg perspective you can avoid that mistake since mdevs can be created for device on any bus like usb or upi as well as pcie. [Feng, Shaohe] Good suggestion.
3. driver should also support a function to create certain mdev type, such as (V100D-1Q, V100D-2Q,) Secondly, we need a mdev extend ARQ(it can be a plugin):
no it should not be a plugin it should jsut be another attachment type.
Here is an example for fpga ext arq: https://review.opendev.org/#/c/681005/26/cyborg/objects/extarq/fpga_ ex t_arq.py@206 The difference is that, we replace the _do_programming to _do_create_mdev For _do_programming, it is used to create a new FPGA function. For _do_create_mdev, it is used to create a new type mdev, it will call the implementation function in vendor driver.
well not really it will echo a uuid into a file in sysfs, that will triger the vendor driver to create the mdev but cyborg is not actully linking to the vendor driver and invoking a c function directly.
At last we need to support a mdev handler for xml generation in nova, we can refer to the cyborg PCI handler in nova
yes i brought this up in the nova ptg as something we will ikely need to add supprot for. as i was suggesting above this can be done by adding a new attachment type which is mdev that jsut contains the uuid instead of the pci address.
So after the above changes: Admin can create different SLA devices profiles such as: {“name”: “Gold_vGPU”, “groups”: [ {“resources:vGPU_BUFFERS”: “16”, “traits: V100D-16Q,”: “required”, }] } And {“name”: “Iron_vGPU”, “groups”: [ {“resources:vGPU_BUFFERS”: “1”, “traits: V100D-1Q,”: “required”, }] } Then a tenant can use Gold_vGPU to create with a VM with V100D-16Q vGPU And another tenant can use Iron_vGPU to create with a VM with V100D-1Q vGPU
it cannot do this on the same physical gpus but yes that could work or you could do {“name”: “Gold_vGPU”, “groups”: [{“resources:CUSTOM_V100D-16Q”: “1”,}]}
currently for nova we just do resouces:vgpu and you can optionally do trait:CUSTOM_
by the way we cannot use standard resouce classes or raits for the mdev types as these are arbitary stings chosen by vendor that can potential change based on kernel of driver version so we shoudl not add them to os-traits or os- resouce-classes and in stead shoudl use CUSTOM_ resouce classes ro traits for them. [Feng, Shaohe] Yes use use CUSTOM_ for them.
When ARQ binding during the VM creating, the Cyborg will call the vendor driver to create expected mdev vGPU. And these 2 mdev vGPU can be on same physical GPU card.
The mdev extend ARQ and vendor driver can be plugin, they are loose couple with the upstream code.
it should not be a plugin its a generic virtualisation attachment mode that can be used by any device we shoudl standardise the attachemt handel in core cyborg and add support for that attachmet model in nova. we already have support for generating the mdev xml so we would only need to wire up the handeling for the attachment type in the code that currently handles the pci attahcmeht type. [Feng, Shaohe] Yes, we will support stand generic ARQ. Only extend ARQ for some special accelerators, FPGA is an example.
i am not sure we need to extend ARQ for FPGA but perhaps at some point. nova does nto support plugins in general so as long as the info we need to request or recive does not vary based on cyborg plugins i guess it coudl be ok but outside of the driver layer i woudl perfosnally avoid intoduciing plugins to cyborg.
So the downstream can get the upstream code to customize the own mdev extend ARQ and vendor driver. Here vGPU is just an example, it can be other mdev devices.
yes so because this can be used for other device that is why i would counter propose that we should create a stateless mdev driver to cover devices that do not require programming or state managment and have a config driven interface to map mdev types to custom resouce classes and/or triats and we also need to declare per device it it support independent pools of each mdev type or if they consome the same ressource. .i.e. to it work like nvidea's v100s where you can only have 1 mdev type per physical device or if it works like the same device where you can have 1 device and multiple mdev types that can be consumed in parallel.
both approches are valid although i personally prefer when they are independent pools since that is eaier to reason about.
you could also support a non config driven approch where we use atibutes on the deployable to describe the mapping of the mdev type to resouce class and if its independetly consumable too i guess but that seams much more combersome to manage. [Feng, Shaohe] Both stateless mdev driver and non config driven can be support. If these cannot satisfy the users, users add their special mdev driver by themselves.
well if i put my downstream had on in terms of productization we are currntly debating if we want to include cyborg in a futrue release of Redhat openstack plathform. at the moment it is not planed for our next major release winch is osp 17 and is in consideration for osp 18. one of the concerns we have with adding cyborg to a futre releae is the lack of a config driven approch. it is not a blocker but haveing an api only based apporch whil it has some advantages also has several drawback not lest of which is supportablity and day 1 and day 2 operational complexity.
for example we have a tool that customer use when reporting bugs called sosreport which automates teh collection of logs config and other system inforamtion. it can also invoke some command like virsh list etc but adding a new config and log to collect is signifcantly less work then adding a new command that needs to discover the node uuid then quiry placement and cyborg apis to determin what acclereator are avaiable and how they are configred. so api only configred service are harder for distos to support from a day perspective when things go wrong. form a day one persepctive it is also simpler for installer tools to template out a config file then it is to inovke commands imperitivaly against an api.
so i understand that form an operators perpesctive invoking an api to do all config managment remotely might be quite appeling but it comes with trade offs. upstreasm should not be governed by downstream distro concerns but we should be mindful not to make it overly hard to deploy and manage the service as it raise the barrier to integrate and if that is too high then that can restul in not being supported in the long run. that said having a service that you just deploy and have no configuration to do would be nice too but if we tell our user that after they deploy cyborg the must then iterate of every deployable and decide dit they should enable it and what atibutes to add to it to schduler correctly i think the complexity might be to high for many.
BR Shaohe Feng
Yes, mtty is a really mdev. We can support a really mdev driver to support it firstly. If fake driver can easily to be extend to mdev, it will help nova to support new mdev attachment type more quickly. Seems no other more benefit. So any suggestion about how the driver support both mixing types mdev and single type mdev. IMHO, for QOS/SLA, it is possible to support different type mdev in one physical device. -----Original Message----- From: Sean Mooney <smooney@redhat.com> Sent: 2020年6月26日 19:15 To: yumeng bao <yumeng_bao@yahoo.com>; Feng, Shaohe <shaohe.feng@intel.com>; openstack-discuss@lists.openstack.org; Wang, Xin-ran <xin-ran.wang@intel.com> Cc: shhfeng@126.com Subject: Re: [cyborg][nova] Support flexible use scenario for Mdev(such as vGPU) On Fri, 2020-06-26 at 05:39 +0000, yumeng bao wrote:
Hi Shaohe and Sean,
Thanks for bringing up this discussion.
1. about the mdev whitelist, I agree and support the idea that cyborg should create a generic mdev driver and support the whitelist configuration of allowed mdevtypes and devices.
2. report the number of allocable resources to resource provider inventory I kind of prefer we report 1 inventory per VGPU. That is to say, admin config the supported mdevtype for the device in the cyborg.conf file, then cyborg report the avaliable_instance of the single selected mdevtype to resource provider inventory.
For the alternative, if we report 1 inventory per mdevtype, that means: 1)we need to report all, but only have one available inventory, and set reserverd = total for all the rest mdev types 2)when admin re-config the mdevtype, we still need to update the newly-selected type to available,while others remain reserved. This sounds like we will report quite a few redundant data to placement. But actually, we care the inventory of the selected type more than other types.
3.1 driver should also support a function to create certain mdev type, such as (V100D-1Q, V100D-2Q,)
well not really it will echo a uuid into a file in sysfs, that will triger the vendor driver to create the mdev but cyborg is not actully linking to the vendor driver and invoking a c function directly.
yes, in the cyborg management cycle, when reporting a mdev device, it just generates uuid(s) for this mdev, but not really create the mdev. I think we can create mdevs by two possible ways:
yes i think both will work
- solution1: A certain mdev type will not be actually created until nova-compute starts to write mdev info to XML and virt driver spawn VM. We can extend current accel_info attachement type to handle both add_accel_pci_device and add_accel_mdev_device in nova/virt/libvirt/driver.py, where before add_accel_mdev_device we can do create_mdev by just calling Nova.privsep.libvirt.create_mdev there.
in this case we would proably want the mdev attachemtn type to contain to the uuid of the mdev to create
- solution2: during ARQ binding process, cyborg creates the mdev by the way similar to Nova.privsep.libvirt.create_mdev
and in this case the attahcmet type would contaienr the mdev path or uuid this is my perfernce as it allows cyborg to create stable mdevs across reboot if it wants too. the mdev uuid can change in novas current code. at the moement that does not matter too much but it might be nice to for example use the deployable objectect uuid as the mdev uuid that way it would be easy to corratlte between the two.
3.2 and one more thing needs to be mentioned, that is we should avoid conflict with the existed mdev management logic in nova, so we may need to introduce a acc_mdev_flag here to check if the mdev request is from cyborg or not.
3.3 xml generation
At last we need to support a mdev handler for xml generation in nova, we can refer to the cyborg PCI handler in nova
yes i brought this up in the nova ptg as something we will ikely need to add supprot for. as i was suggesting above this can be done by adding a new attachment type which is mdev that jsut contains the uuid instead of the pci address.
+1.agree. supporting the new attachment type makes sense.
4. mdev fake driver support
IMHO, cyborg can support a mdev fake driver(similar to the current FPGA fake driver) for the mdev attachment type support in nova Or maybe we can extend the current fake driver support both mdev and pci devices
in parralel if the fake driver can be exteded to support mdev attachmets we can also use that as another way to validate the interaction. the difference between the two arrpochse is using the mdpy or mtty kernel moduels would allow us to actully use mdev attachmetn adn add the mdev to a vm where as for the fake driver approch we would not add the fake mdev to a the libvirt xml or to the vm. the mdpy and mtty sample kernel modules are intended for use in testing the mdev frame work without any hardware requiremnt so i think that is perfect for our ci system. its more work then just extending the fake driver but it has several benifits.
@Sean: "the mdpy and mtty sample kernel modules are intended for use in testing the mdev frame work without any hardware requiremnt so i think that is perfect for our ci system. " does this mean this can also be a easier way for third-party CI for mdev device?
i think if w euse the mdpy or mtty sample kernel modeuls we would not need a third party ci and could fully test mdev support in the first party ci. third party ci would only then be required for stateful mdev device that needed the a custom driver to do some intial programing or cleanup of the device that the generic mdev driver could not do. the mtty sample module https://github.com/torvalds/linux/blob/master/samples/vfio-mdev/mtty.c emulates a virtual serial prot that acts basically as an echo server. if you create the device # echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1001" > \ /sys/devices/virtual/mtty/mtty/mdev_supported_types/mtty-2/create and add it to qemu -device vfio-pci,\ sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 then it will show up in the guest as a pci device with vendor and product id 4348:3253 # lspci -s 00:05.0 -xxvv 00:05.0 Serial controller: Device 4348:3253 (rev 10) (prog-if 02 [16550]) Subsystem: Device 4348:3253 Physical Slot: 5 Control: I/O+ Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Interrupt: pin A routed to IRQ 10 Region 0: I/O ports at c150 [size=8] Region 1: I/O ports at c158 [size=8] Kernel driver in use: serial 00: 48 43 53 32 01 00 00 02 10 02 00 07 00 00 00 00 10: 51 c1 00 00 59 c1 00 00 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 48 43 53 32 30: 00 00 00 00 00 00 00 00 00 00 00 00 0a 01 00 00 In the Linux guest VM, dmesg output for the device is as follows: serial 0000:00:05.0: PCI INT A -> Link[LNKA] -> GSI 10 (level, high) -> IRQ 10 0000:00:05.0: ttyS1 at I/O 0xc150 (irq = 10) is a 16550A 0000:00:05.0: ttyS2 at I/O 0xc158 (irq = 10) is a 16550A you can then use minicom or any serial console application to write data to ttyS1 or ttyS2 in the guest and the host mdev module will loop it back ascting as an echo server. so we shoudl be able to add an optional tempest test to the cyborg tempest plugin to fully validate end to end fuctioning of generic mdev support including sshing into a vm that is using an mtty serial port and validating it loops back data fully testing the fature in the first party ci. all we need to do is compile an modprob the mtty device using a devstack plugin in the gate job.
best regards, Yumeng
On Thursday, June 25, 2020, 08:43:11 PM GMT+8, Sean Mooney <smooney@redhat.com> wrote:
On Thu, 2020-06-25 at 00:31 +0000, Feng, Shaohe wrote:
Hi Yumeng and Xin-ran: Not sure you noticed that Sean Mooney has brought up that nova support mdev attachment type in the nova PTG, as follow:
yes i brought this up in the nova ptg as something we will ikely need to add supprot for. as i was suggesting above this can be done by adding a new attachment type which is mdev that jsut contains the uuid instead of the pci address.
IMHO, cyborg can support a mdev fake driver(similar to the current FPGA fake driver) for the mdev attachment type support in nova Or maybe we can extend the current fake driver support both mdev and pci devices
cyborg could support a fake mdev driver yes however i do think adding support do deploy with the mdpy drvier also makes sense. vm booted with a gpu backed by an mdpy mdev actually get a functionality frame buffer and you can view it in the default vnc console. i have manually verifed this.
i think this is out of scope of cyborg however. what i was planning to do if i ever get the time is is to create an mdpy devstack plug in that would compile and deploy the kernel module. with aht we can just add the plugin to a zuul job and hten whitelist the mdev types eitehr in nova or cyborg to do testing with it and the proposed stateless mdev driver.
in parralel if the fake driver can be exteded to support mdev attachmets we can also use that as another way to validate the interaction. the difference between the two arrpochse is using the mdpy or mtty kernel moduels would allow us to actully use mdev attachmetn adn add the mdev to a vm where as for the fake driver approch we would not add the fake mdev to a the libvirt xml or to the vm. the mdpy and mtty sample kernel modules are intended for use in testing the mdev frame work without any hardware requiremnt so i think that is perfect for our ci system. its more work then just extending the fake driver but it has several benifits.
BR Shaohe
-----Original Message----- From: Sean Mooney <smooney@redhat.com> Sent: 2020年6月23日 21:38 To: Feng, Shaohe <shaohe.feng@intel.com>; openstack-discuss@lists.openstack.org Cc: yumeng_bao@yahoo.com; shhfeng@126.com Subject: Re: [cyborg][nova] Support flexible use scenario for Mdev(such as vGPU)
On Tue, 2020-06-23 at 12:25 +0000, Feng, Shaohe wrote:
-----Original Message----- From: Sean Mooney <smooney@redhat.com> Sent: 2020年6月23日 19:48 To: Feng, Shaohe <shaohe.feng@intel.com>; openstack-discuss@lists.openstack.org Cc: yumeng_bao@yahoo.com; shhfeng@126.com Subject: Re: [cyborg][nova] Support flexible use scenario for Mdev(such as vGPU)
On Tue, 2020-06-23 at 05:50 +0000, Feng, Shaohe wrote:
Hi all,
Currently openstack support vGPU as follow: https://docs.openstack.org/nova/latest/admin/virtual-gpu.html
In order to support it, admin should plan ahead and configure the vGPU before deployable as follow: https://docs.openstack.org/nova/latest/configuration/config.html #dev ic es.enabled_vgpu_types This is very inconvenient for the administrator, this method has a limitation that a same PCI address does not provide two different types.
that is a matter of perspective there are those that prefer to check all fo there configuration into git and have a declaritive deployment and those that wish to drive everything via the api for the latter having to confiugre availbe resouce via cybogs api or placemtn would be consider very inconvenient.
Cyborg as an accelerator management tool is more suitable for mdev device management.
maybe but maybe not. i do think that cyborg should support mdevs i do not think we should have a dedicated vgpu mdev driver however. i think we should crate a stateless mdev driver that uses a similar whitelist of allowed mdevtypes and devices.
we did breifly discuss adding generic mdev support to nova in a future releae (w) or if we shoudl delegate that to cyborg but i would hesitate to do that if cyborg continutes its current design where driver have no configuraton element to whitelist device as it makes it much harder for deployment tools to properly configure a host declaritivly. [Feng, Shaohe] We did support config, such as our demo for fpga pre-program, we support config for our new drivers. And such as other accelerators, maybe the infra also need accelerators for acceleration not only VM needs. For example, cinder can use QAT for compress/crypto, and VM also can QAT. We need to configure which QATs are for infra and which for VMs.
yes qat is a good examlple of where shareign between host useage(cinder service) and guest usage(vms) could be required the sam could be true of gpus. typically many servers run headleas but not always and sometime you will want to resrve a gpu for the host to use. nics are another good example wehen we look at generic cpi passthough we need to select whic nics will be used by the vms and which will be used for host connectivity or fo hardware offloed ovs.
One solution as follow: Firstly, we need a vender driver(this can be a plugin), it is used to discovery its special devices and report them to placement for schedule. The difference from the current implementation is that: 1. report the mdev_supported_types as traits to resource provider. How to discover a GPU type: $ ls /sys/class/mdev_bus/*/mdev_supported_types /sys/class/mdev_bus/0000:84:00.0/mdev_supported_types: nvidia-35 nvidia-36 nvidia-37 nvidia-38 nvidia-39 nvidia-40 nvidia-41 nvidia-42 nvidia-43 nvidia-44 nvidia- 45 so here we report nvidia-3*, nvidia-4* as traits to resource provider. 2. Report the number of allocable resources instead of vGPU unit numbers to resource provider inventory Example for the NVidia V100 PCIe card (one GPU per board) : Virtual GPU Type Frame Buffer (Gbytes) Maximum vGPUs per GPU Maximum vGPUs per Board V100D-32Q 32 1 1 V100D-16Q 16 2 2 V100D-8Q 8 4 4 V100D-4Q 4 8 8 V100D-2Q 2 16 16 V100D-1Q 1 32 32 so here we report 32G Buffers(an example, maybe other resources) to resource provider inventory
in this specific example that would not be a good idea. the V100 does not support mixing mdev types on the same gpu so if you allocate a V100D-16Q instace using 16G of the buffer you cannot then allocate 2 V100D-8Q vgpu instance to consume the remaining 16G other mdev based device may not have this limitation but nvida only support 1 active mdevtype per phsyical gpu.
not that the ampere generation has a dynmaic sriov based muilti instance gpu technology which kind of allow resouce based subdivietion of the device but it does not quite work the way you are describing above.
so you can report inventories of custom resource classes of for each of the mdev types or a single inventory of VGPU with traits modelling the avialble mdevs.
with the trait approch before a vgpu is allocated you report all traits and the total count for the inventory would be the hightest amount e.g. 32 in this case above then when a gpu is allocated you need to update the reserved value and remove the other traits.
[Feng, Shaohe] For V100 does not support mixing mdev types, that need to remove the other traits. So any suggestion about how a generic driver support both mixing types mdev and single type mdev?
if you have 1 inventory per mdev type then you set reserved = total for all inventories for the other mdev types but no need for traits. [Feng, Shaohe] Oh, really sorry, I should choose a good example.
the sample mdpy kernel module which create a basic virtual graphice deice support multiple mdev type for different resolutions https://github.com/torvalds/linux/blob/f97c81dc6ca5996560b3944064f63 fc87eb18d00/samples/vfio-mdev/mdpy.c i belive it also support consuming each mdev type independtly. so if you dont want to use real hardware as an example ther eare at least sample devicce that support having multiple active mdevs. i would also suggest we use this device for testing in the upstream gate.
i started creating a jobs to test novas vgpu support with this smaple device a few months back but we need to make a few small change to make it work and i was not sure it was approate to modify the nova code just to get ci working with a fake device. currently we make 1 assumtion about that parent of the mdev being a pci deivce which is not true in the kernel module case.
but form a cyborg perspective you can avoid that mistake since mdevs can be created for device on any bus like usb or upi as well as pcie. [Feng, Shaohe] Good suggestion.
3. driver should also support a function to create certain mdev type, such as (V100D-1Q, V100D-2Q,) Secondly, we need a mdev extend ARQ(it can be a plugin):
no it should not be a plugin it should jsut be another attachment type.
Here is an example for fpga ext arq: https://review.opendev.org/#/c/681005/26/cyborg/objects/extarq/f pga_ ex t_arq.py@206 The difference is that, we replace the _do_programming to _do_create_mdev For _do_programming, it is used to create a new FPGA function. For _do_create_mdev, it is used to create a new type mdev, it will call the implementation function in vendor driver.
well not really it will echo a uuid into a file in sysfs, that will triger the vendor driver to create the mdev but cyborg is not actully linking to the vendor driver and invoking a c function directly.
At last we need to support a mdev handler for xml generation in nova, we can refer to the cyborg PCI handler in nova
yes i brought this up in the nova ptg as something we will ikely need to add supprot for. as i was suggesting above this can be done by adding a new attachment type which is mdev that jsut contains the uuid instead of the pci address.
So after the above changes: Admin can create different SLA devices profiles such as: {“name”: “Gold_vGPU”, “groups”: [ {“resources:vGPU_BUFFERS”: “16”, “traits: V100D-16Q,”: “required”, }] } And {“name”: “Iron_vGPU”, “groups”: [ {“resources:vGPU_BUFFERS”: “1”, “traits: V100D-1Q,”: “required”, }] } Then a tenant can use Gold_vGPU to create with a VM with V100D-16Q vGPU And another tenant can use Iron_vGPU to create with a VM with V100D-1Q vGPU
it cannot do this on the same physical gpus but yes that could work or you could do {“name”: “Gold_vGPU”, “groups”: [{“resources:CUSTOM_V100D-16Q”: “1”,}]}
currently for nova we just do resouces:vgpu and you can optionally do trait:CUSTOM_
by the way we cannot use standard resouce classes or raits for the mdev types as these are arbitary stings chosen by vendor that can potential change based on kernel of driver version so we shoudl not add them to os-traits or os- resouce-classes and in stead shoudl use CUSTOM_ resouce classes ro traits for them. [Feng, Shaohe] Yes use use CUSTOM_ for them.
When ARQ binding during the VM creating, the Cyborg will call the vendor driver to create expected mdev vGPU. And these 2 mdev vGPU can be on same physical GPU card.
The mdev extend ARQ and vendor driver can be plugin, they are loose couple with the upstream code.
it should not be a plugin its a generic virtualisation attachment mode that can be used by any device we shoudl standardise the attachemt handel in core cyborg and add support for that attachmet model in nova. we already have support for generating the mdev xml so we would only need to wire up the handeling for the attachment type in the code that currently handles the pci attahcmeht type. [Feng, Shaohe] Yes, we will support stand generic ARQ. Only extend ARQ for some special accelerators, FPGA is an example.
i am not sure we need to extend ARQ for FPGA but perhaps at some point. nova does nto support plugins in general so as long as the info we need to request or recive does not vary based on cyborg plugins i guess it coudl be ok but outside of the driver layer i woudl perfosnally avoid intoduciing plugins to cyborg.
So the downstream can get the upstream code to customize the own mdev extend ARQ and vendor driver. Here vGPU is just an example, it can be other mdev devices.
yes so because this can be used for other device that is why i would counter propose that we should create a stateless mdev driver to cover devices that do not require programming or state managment and have a config driven interface to map mdev types to custom resouce classes and/or triats and we also need to declare per device it it support independent pools of each mdev type or if they consome the same ressource. .i.e. to it work like nvidea's v100s where you can only have 1 mdev type per physical device or if it works like the same device where you can have 1 device and multiple mdev types that can be consumed in parallel.
both approches are valid although i personally prefer when they are independent pools since that is eaier to reason about.
you could also support a non config driven approch where we use atibutes on the deployable to describe the mapping of the mdev type to resouce class and if its independetly consumable too i guess but that seams much more combersome to manage. [Feng, Shaohe] Both stateless mdev driver and non config driven can be support. If these cannot satisfy the users, users add their special mdev driver by themselves.
well if i put my downstream had on in terms of productization we are currntly debating if we want to include cyborg in a futrue release of Redhat openstack plathform. at the moment it is not planed for our next major release winch is osp 17 and is in consideration for osp 18. one of the concerns we have with adding cyborg to a futre releae is the lack of a config driven approch. it is not a blocker but haveing an api only based apporch whil it has some advantages also has several drawback not lest of which is supportablity and day 1 and day 2 operational complexity.
for example we have a tool that customer use when reporting bugs called sosreport which automates teh collection of logs config and other system inforamtion. it can also invoke some command like virsh list etc but adding a new config and log to collect is signifcantly less work then adding a new command that needs to discover the node uuid then quiry placement and cyborg apis to determin what acclereator are avaiable and how they are configred. so api only configred service are harder for distos to support from a day perspective when things go wrong. form a day one persepctive it is also simpler for installer tools to template out a config file then it is to inovke commands imperitivaly against an api.
so i understand that form an operators perpesctive invoking an api to do all config managment remotely might be quite appeling but it comes with trade offs. upstreasm should not be governed by downstream distro concerns but we should be mindful not to make it overly hard to deploy and manage the service as it raise the barrier to integrate and if that is too high then that can restul in not being supported in the long run. that said having a service that you just deploy and have no configuration to do would be nice too but if we tell our user that after they deploy cyborg the must then iterate of every deployable and decide dit they should enable it and what atibutes to add to it to schduler correctly i think the complexity might be to high for many.
BR Shaohe Feng
On Fri, 2020-06-26 at 12:19 +0000, Feng, Shaohe wrote:
Yes, mtty is a really mdev. We can support a really mdev driver to support it firstly. If fake driver can easily to be extend to mdev, it will help nova to support new mdev attachment type more quickly. Seems no other more benefit.
So any suggestion about how the driver support both mixing types mdev and single type mdev. IMHO, for QOS/SLA, it is possible to support different type mdev in one physical device. yes its possible to do it
if we have a whitelist of of mdev types or devics that support mdevs we can jsut tag them as supporting multile type or a signel type. for example we coudl add something like this [mdevs] # devies provides a mapping between a device and an alias for that device # we could use something other then a sysfs path but thats not importnat in this example # you could have many device map the the same alais if you have multiple of the same acclerator devices={/sys/devices/virtual/mtty:mdev, /sys/devices/virtual/mdpy:mdpy} # device_capablites is a way to declare atributes or traits for the deployable device_capablities {mdpy:{multi_mdev:true, traits=[CUSTOM_WHATEVER]}, mtty:{multi_mdev:false}} # and mdev_types mapps the alais to the set of mdev types that are allowed for that alias. mdev_types={mtty:[mtty2], mdpy:[mdpy-hd, mdpy-vga]} so in reallity if this wa a qat device for example it had multiel enyption and compuression enginges whiel they share pci bandwith my understandin is that the asmemtcy crypto hardware on qat operates indepentely of the compuression hardware so if intel impemetned mdev support for qat then you could do somethinkg like this [mdevs] devices={/sys/bus/pci/...:qat} device_capablities {qat:{multi_mdev:true}} mdev_types={qat:[qat-crypto, qat- compression]} if nvida support multile mdev types for gpus in the future you could do the exact same thing. this is just an idea but that is more or less the direction i woudl go in. there are othere was to do it. but i think you need 3 pieces of info 1 which devices is the driver allowed to manage (this address the host useage fo a device vs a guests usage ) 2 does the device support multipel mdevs and optionally some other metadta 3 of the available mdev_types whcich ones are allowed to be used.
-----Original Message----- From: Sean Mooney <smooney@redhat.com> Sent: 2020年6月26日 19:15 To: yumeng bao <yumeng_bao@yahoo.com>; Feng, Shaohe <shaohe.feng@intel.com>; openstack-discuss@lists.openstack.org; Wang, Xin-ran <xin-ran.wang@intel.com> Cc: shhfeng@126.com Subject: Re: [cyborg][nova] Support flexible use scenario for Mdev(such as vGPU)
On Fri, 2020-06-26 at 05:39 +0000, yumeng bao wrote:
Hi Shaohe and Sean,
Thanks for bringing up this discussion.
1. about the mdev whitelist, I agree and support the idea that cyborg should create a generic mdev driver and support the whitelist configuration of allowed mdevtypes and devices.
2. report the number of allocable resources to resource provider inventory I kind of prefer we report 1 inventory per VGPU. That is to say, admin config the supported mdevtype for the device in the cyborg.conf file, then cyborg report the avaliable_instance of the single selected mdevtype to resource provider inventory.
For the alternative, if we report 1 inventory per mdevtype, that means: 1)we need to report all, but only have one available inventory, and set reserverd = total for all the rest mdev types 2)when admin re-config the mdevtype, we still need to update the newly-selected type to available,while others remain reserved. This sounds like we will report quite a few redundant data to placement. But actually, we care the inventory of the selected type more than other types.
3.1 driver should also support a function to create certain mdev type, such as (V100D-1Q, V100D-2Q,)
well not really it will echo a uuid into a file in sysfs, that will triger the vendor driver to create the mdev but cyborg is not actully linking to the vendor driver and invoking a c function directly.
yes, in the cyborg management cycle, when reporting a mdev device, it just generates uuid(s) for this mdev, but not really create the mdev. I think we can create mdevs by two possible ways:
yes i think both will work
- solution1: A certain mdev type will not be actually created until nova-compute starts to write mdev info to XML and virt driver spawn VM. We can extend current accel_info attachement type to handle both add_accel_pci_device and add_accel_mdev_device in nova/virt/libvirt/driver.py, where before add_accel_mdev_device we can do create_mdev by just calling Nova.privsep.libvirt.create_mdev there.
in this case we would proably want the mdev attachemtn type to contain to the uuid of the mdev to create
- solution2: during ARQ binding process, cyborg creates the mdev by the way similar to Nova.privsep.libvirt.create_mdev
and in this case the attahcmet type would contaienr the mdev path or uuid this is my perfernce as it allows cyborg to create stable mdevs across reboot if it wants too. the mdev uuid can change in novas current code. at the moement that does not matter too much but it might be nice to for example use the deployable objectect uuid as the mdev uuid that way it would be easy to corratlte between the two.
3.2 and one more thing needs to be mentioned, that is we should avoid conflict with the existed mdev management logic in nova, so we may need to introduce a acc_mdev_flag here to check if the mdev request is from cyborg or not.
3.3 xml generation
At last we need to support a mdev handler for xml generation in nova, we can refer to the cyborg PCI handler in nova
yes i brought this up in the nova ptg as something we will ikely need to add supprot for. as i was suggesting above this can be done by adding a new attachment type which is mdev that jsut contains the uuid instead of the pci address.
+1.agree. supporting the new attachment type makes sense.
4. mdev fake driver support
IMHO, cyborg can support a mdev fake driver(similar to the current FPGA fake driver) for the mdev attachment type support in nova Or maybe we can extend the current fake driver support both mdev and pci devices
in parralel if the fake driver can be exteded to support mdev attachmets we can also use that as another way to validate the interaction. the difference between the two arrpochse is using the mdpy or mtty kernel moduels would allow us to actully use mdev attachmetn adn add the mdev to a vm where as for the fake driver approch we would not add the fake mdev to a the libvirt xml or to the vm. the mdpy and mtty sample kernel modules are intended for use in testing the mdev frame work without any hardware requiremnt so i think that is perfect for our ci system. its more work then just extending the fake driver but it has several benifits.
@Sean: "the mdpy and mtty sample kernel modules are intended for use in testing the mdev frame work without any hardware requiremnt so i think that is perfect for our ci system. " does this mean this can also be a easier way for third-party CI for mdev device?
i think if w euse the mdpy or mtty sample kernel modeuls we would not need a third party ci and could fully test mdev support in the first party ci. third party ci would only then be required for stateful mdev device that needed the a custom driver to do some intial programing or cleanup of the device that the generic mdev driver could not do.
the mtty sample module https://github.com/torvalds/linux/blob/master/samples/vfio-mdev/mtty.c emulates a virtual serial prot that acts basically as an echo server.
if you create the device # echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1001" > \ /sys/devices/virtual/mtty/mtty/mdev_supported_types/mtty-2/create and add it to qemu
-device vfio-pci,\ sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1001
then it will show up in the guest as a pci device with vendor and product id 4348:3253
# lspci -s 00:05.0 -xxvv 00:05.0 Serial controller: Device 4348:3253 (rev 10) (prog-if 02 [16550]) Subsystem: Device 4348:3253 Physical Slot: 5 Control: I/O+ Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Interrupt: pin A routed to IRQ 10 Region 0: I/O ports at c150 [size=8] Region 1: I/O ports at c158 [size=8] Kernel driver in use: serial 00: 48 43 53 32 01 00 00 02 10 02 00 07 00 00 00 00 10: 51 c1 00 00 59 c1 00 00 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 48 43 53 32 30: 00 00 00 00 00 00 00 00 00 00 00 00 0a 01 00 00
In the Linux guest VM, dmesg output for the device is as follows:
serial 0000:00:05.0: PCI INT A -> Link[LNKA] -> GSI 10 (level, high) -> IRQ 10 0000:00:05.0: ttyS1 at I/O 0xc150 (irq = 10) is a 16550A 0000:00:05.0: ttyS2 at I/O 0xc158 (irq = 10) is a 16550A
you can then use minicom or any serial console application to write data to ttyS1 or ttyS2 in the guest and the host mdev module will loop it back ascting as an echo server.
so we shoudl be able to add an optional tempest test to the cyborg tempest plugin to fully validate end to end fuctioning of generic mdev support including sshing into a vm that is using an mtty serial port and validating it loops back data fully testing the fature in the first party ci.
all we need to do is compile an modprob the mtty device using a devstack plugin in the gate job.
best regards, Yumeng
On Thursday, June 25, 2020, 08:43:11 PM GMT+8, Sean Mooney <smooney@redhat.com> wrote:
On Thu, 2020-06-25 at 00:31 +0000, Feng, Shaohe wrote:
Hi Yumeng and Xin-ran: Not sure you noticed that Sean Mooney has brought up that nova support mdev attachment type in the nova PTG, as follow:
yes i brought this up in the nova ptg as something we will ikely need to add supprot for. as i was suggesting above this can be done by adding a new attachment type which is mdev that jsut contains the uuid instead of the pci address.
IMHO, cyborg can support a mdev fake driver(similar to the current FPGA fake driver) for the mdev attachment type support in nova Or maybe we can extend the current fake driver support both mdev and pci devices
cyborg could support a fake mdev driver yes however i do think adding support do deploy with the mdpy drvier also makes sense. vm booted with a gpu backed by an mdpy mdev actually get a functionality frame buffer and you can view it in the default vnc console. i have manually verifed this.
i think this is out of scope of cyborg however. what i was planning to do if i ever get the time is is to create an mdpy devstack plug in that would compile and deploy the kernel module. with aht we can just add the plugin to a zuul job and hten whitelist the mdev types eitehr in nova or cyborg to do testing with it and the proposed stateless mdev driver.
in parralel if the fake driver can be exteded to support mdev attachmets we can also use that as another way to validate the interaction. the difference between the two arrpochse is using the mdpy or mtty kernel moduels would allow us to actully use mdev attachmetn adn add the mdev to a vm where as for the fake driver approch we would not add the fake mdev to a the libvirt xml or to the vm. the mdpy and mtty sample kernel modules are intended for use in testing the mdev frame work without any hardware requiremnt so i think that is perfect for our ci system. its more work then just extending the fake driver but it has several benifits.
BR Shaohe
-----Original Message----- From: Sean Mooney <smooney@redhat.com> Sent: 2020年6月23日 21:38 To: Feng, Shaohe <shaohe.feng@intel.com>; openstack-discuss@lists.openstack.org Cc: yumeng_bao@yahoo.com; shhfeng@126.com Subject: Re: [cyborg][nova] Support flexible use scenario for Mdev(such as vGPU)
On Tue, 2020-06-23 at 12:25 +0000, Feng, Shaohe wrote:
-----Original Message----- From: Sean Mooney <smooney@redhat.com> Sent: 2020年6月23日 19:48 To: Feng, Shaohe <shaohe.feng@intel.com>; openstack-discuss@lists.openstack.org Cc: yumeng_bao@yahoo.com; shhfeng@126.com Subject: Re: [cyborg][nova] Support flexible use scenario for Mdev(such as vGPU)
On Tue, 2020-06-23 at 05:50 +0000, Feng, Shaohe wrote:
Hi all,
Currently openstack support vGPU as follow: https://docs.openstack.org/nova/latest/admin/virtual-gpu.html
In order to support it, admin should plan ahead and configure the vGPU before deployable as follow: https://docs.openstack.org/nova/latest/configuration/config.html #dev ic es.enabled_vgpu_types This is very inconvenient for the administrator, this method has a limitation that a same PCI address does not provide two different types.
that is a matter of perspective there are those that prefer to check all fo there configuration into git and have a declaritive deployment and those that wish to drive everything via the api for the latter having to confiugre availbe resouce via cybogs api or placemtn would be consider very inconvenient.
Cyborg as an accelerator management tool is more suitable for mdev device management.
maybe but maybe not. i do think that cyborg should support mdevs i do not think we should have a dedicated vgpu mdev driver however. i think we should crate a stateless mdev driver that uses a similar whitelist of allowed mdevtypes and devices.
we did breifly discuss adding generic mdev support to nova in a future releae (w) or if we shoudl delegate that to cyborg but i would hesitate to do that if cyborg continutes its current design where driver have no configuraton element to whitelist device as it makes it much harder for deployment tools to properly configure a host declaritivly. [Feng, Shaohe] We did support config, such as our demo for fpga pre-program, we support config for our new drivers. And such as other accelerators, maybe the infra also need accelerators for acceleration not only VM needs. For example, cinder can use QAT for compress/crypto, and VM also can QAT. We need to configure which QATs are for infra and which for VMs.
yes qat is a good examlple of where shareign between host useage(cinder service) and guest usage(vms) could be required the sam could be true of gpus. typically many servers run headleas but not always and sometime you will want to resrve a gpu for the host to use. nics are another good example wehen we look at generic cpi passthough we need to select whic nics will be used by the vms and which will be used for host connectivity or fo hardware offloed ovs.
One solution as follow: Firstly, we need a vender driver(this can be a plugin), it is used to discovery its special devices and report them to placement for schedule. The difference from the current implementation is that: 1. report the mdev_supported_types as traits to resource provider. How to discover a GPU type: $ ls /sys/class/mdev_bus/*/mdev_supported_types /sys/class/mdev_bus/0000:84:00.0/mdev_supported_types: nvidia-35 nvidia-36 nvidia-37 nvidia-38 nvidia-39 nvidia-40 nvidia-41 nvidia-42 nvidia-43 nvidia-44 nvidia- 45 so here we report nvidia-3*, nvidia-4* as traits to resource provider. 2. Report the number of allocable resources instead of vGPU unit numbers to resource provider inventory Example for the NVidia V100 PCIe card (one GPU per board) : Virtual GPU Type Frame Buffer (Gbytes) Maximum vGPUs per GPU Maximum vGPUs per Board V100D-32Q 32 1 1 V100D-16Q 16 2 2 V100D-8Q 8 4 4 V100D-4Q 4 8 8 V100D-2Q 2 16 16 V100D-1Q 1 32 32 so here we report 32G Buffers(an example, maybe other resources) to resource provider inventory
in this specific example that would not be a good idea. the V100 does not support mixing mdev types on the same gpu so if you allocate a V100D-16Q instace using 16G of the buffer you cannot then allocate 2 V100D-8Q vgpu instance to consume the remaining 16G other mdev based device may not have this limitation but nvida only support 1 active mdevtype per phsyical gpu.
not that the ampere generation has a dynmaic sriov based muilti instance gpu technology which kind of allow resouce based subdivietion of the device but it does not quite work the way you are describing above.
so you can report inventories of custom resource classes of for each of the mdev types or a single inventory of VGPU with traits modelling the avialble mdevs.
with the trait approch before a vgpu is allocated you report all traits and the total count for the inventory would be the hightest amount e.g. 32 in this case above then when a gpu is allocated you need to update the reserved value and remove the other traits.
[Feng, Shaohe] For V100 does not support mixing mdev types, that need to remove the other traits. So any suggestion about how a generic driver support both mixing types mdev and single type mdev?
if you have 1 inventory per mdev type then you set reserved = total for all inventories for the other mdev types but no need for traits. [Feng, Shaohe] Oh, really sorry, I should choose a good example.
the sample mdpy kernel module which create a basic virtual graphice deice support multiple mdev type for different resolutions https://github.com/torvalds/linux/blob/f97c81dc6ca5996560b3944064f63 fc87eb18d00/samples/vfio-mdev/mdpy.c i belive it also support consuming each mdev type independtly. so if you dont want to use real hardware as an example ther eare at least sample devicce that support having multiple active mdevs. i would also suggest we use this device for testing in the upstream gate.
i started creating a jobs to test novas vgpu support with this smaple device a few months back but we need to make a few small change to make it work and i was not sure it was approate to modify the nova code just to get ci working with a fake device. currently we make 1 assumtion about that parent of the mdev being a pci deivce which is not true in the kernel module case.
but form a cyborg perspective you can avoid that mistake since mdevs can be created for device on any bus like usb or upi as well as pcie. [Feng, Shaohe] Good suggestion.
3. driver should also support a function to create certain mdev type, such as (V100D-1Q, V100D-2Q,) Secondly, we need a mdev extend ARQ(it can be a plugin):
no it should not be a plugin it should jsut be another attachment type.
Here is an example for fpga ext arq: https://review.opendev.org/#/c/681005/26/cyborg/objects/extarq/f pga_ ex t_arq.py@206 The difference is that, we replace the _do_programming to _do_create_mdev For _do_programming, it is used to create a new FPGA function. For _do_create_mdev, it is used to create a new type mdev, it will call the implementation function in vendor driver.
well not really it will echo a uuid into a file in sysfs, that will triger the vendor driver to create the mdev but cyborg is not actully linking to the vendor driver and invoking a c function directly.
At last we need to support a mdev handler for xml generation in nova, we can refer to the cyborg PCI handler in nova
yes i brought this up in the nova ptg as something we will ikely need to add supprot for. as i was suggesting above this can be done by adding a new attachment type which is mdev that jsut contains the uuid instead of the pci address.
So after the above changes: Admin can create different SLA devices profiles such as: {“name”: “Gold_vGPU”, “groups”: [ {“resources:vGPU_BUFFERS”: “16”, “traits: V100D-16Q,”: “required”, }] } And {“name”: “Iron_vGPU”, “groups”: [ {“resources:vGPU_BUFFERS”: “1”, “traits: V100D-1Q,”: “required”, }] } Then a tenant can use Gold_vGPU to create with a VM with V100D-16Q vGPU And another tenant can use Iron_vGPU to create with a VM with V100D-1Q vGPU
it cannot do this on the same physical gpus but yes that could work or you could do {“name”: “Gold_vGPU”, “groups”: [{“resources:CUSTOM_V100D-16Q”: “1”,}]}
currently for nova we just do resouces:vgpu and you can optionally do trait:CUSTOM_
by the way we cannot use standard resouce classes or raits for the mdev types as these are arbitary stings chosen by vendor that can potential change based on kernel of driver version so we shoudl not add them to os-traits or os- resouce-classes and in stead shoudl use CUSTOM_ resouce classes ro traits for them. [Feng, Shaohe] Yes use use CUSTOM_ for them.
When ARQ binding during the VM creating, the Cyborg will call the vendor driver to create expected mdev vGPU. And these 2 mdev vGPU can be on same physical GPU card.
The mdev extend ARQ and vendor driver can be plugin, they are loose couple with the upstream code.
it should not be a plugin its a generic virtualisation attachment mode that can be used by any device we shoudl standardise the attachemt handel in core cyborg and add support for that attachmet model in nova. we already have support for generating the mdev xml so we would only need to wire up the handeling for the attachment type in the code that currently handles the pci attahcmeht type. [Feng, Shaohe] Yes, we will support stand generic ARQ. Only extend ARQ for some special accelerators, FPGA is an example.
i am not sure we need to extend ARQ for FPGA but perhaps at some point. nova does nto support plugins in general so as long as the info we need to request or recive does not vary based on cyborg plugins i guess it coudl be ok but outside of the driver layer i woudl perfosnally avoid intoduciing plugins to cyborg.
So the downstream can get the upstream code to customize the own mdev extend ARQ and vendor driver. Here vGPU is just an example, it can be other mdev devices.
yes so because this can be used for other device that is why i would counter propose that we should create a stateless mdev driver to cover devices that do not require programming or state managment and have a config driven interface to map mdev types to custom resouce classes and/or triats and we also need to declare per device it it support independent pools of each mdev type or if they consome the same ressource. .i.e. to it work like nvidea's v100s where you can only have 1 mdev type per physical device or if it works like the same device where you can have 1 device and multiple mdev types that can be consumed in parallel.
both approches are valid although i personally prefer when they are independent pools since that is eaier to reason about.
you could also support a non config driven approch where we use atibutes on the deployable to describe the mapping of the mdev type to resouce class and if its independetly consumable too i guess but that seams much more combersome to manage. [Feng, Shaohe] Both stateless mdev driver and non config driven can be support. If these cannot satisfy the users, users add their special mdev driver by themselves.
well if i put my downstream had on in terms of productization we are currntly debating if we want to include cyborg in a futrue release of Redhat openstack plathform. at the moment it is not planed for our next major release winch is osp 17 and is in consideration for osp 18. one of the concerns we have with adding cyborg to a futre releae is the lack of a config driven approch. it is not a blocker but haveing an api only based apporch whil it has some advantages also has several drawback not lest of which is supportablity and day 1 and day 2 operational complexity.
for example we have a tool that customer use when reporting bugs called sosreport which automates teh collection of logs config and other system inforamtion. it can also invoke some command like virsh list etc but adding a new config and log to collect is signifcantly less work then adding a new command that needs to discover the node uuid then quiry placement and cyborg apis to determin what acclereator are avaiable and how they are configred. so api only configred service are harder for distos to support from a day perspective when things go wrong. form a day one persepctive it is also simpler for installer tools to template out a config file then it is to inovke commands imperitivaly against an api.
so i understand that form an operators perpesctive invoking an api to do all config managment remotely might be quite appeling but it comes with trade offs. upstreasm should not be governed by downstream distro concerns but we should be mindful not to make it overly hard to deploy and manage the service as it raise the barrier to integrate and if that is too high then that can restul in not being supported in the long run. that said having a service that you just deploy and have no configuration to do would be nice too but if we tell our user that after they deploy cyborg the must then iterate of every deployable and decide dit they should enable it and what atibutes to add to it to schduler correctly i think the complexity might be to high for many.
BR Shaohe Feng
Cool suggestions! Thanks!
- solution2: during ARQ binding process, cyborg creates the mdev by the way similar to Nova.privsep.libvirt.create_mdev
and in this case the attahcmet type would contaienr the mdev path or uuid this is my perfernce as it allows cyborg to create stable mdevs across reboot if it wants too. the mdev uuid can change in novas current code. at the moement that does not matter too much but it might be nice to for example use the deployable objectect uuid as the mdev uuid that way it would be easy to corratlte between the two.
Although doesn’t know why mdev uuid can change in nova’s current code, I think cyborg should use the attach_handle uuid as the mdev uuid, because that’s the atomic allocable vgpu resource, while deployable will uuid refer to the physical GPU unit. When cyborg reports the following device with mdev_type=‘nvidia-231’, it reports 1 deployable_uuid, and 8 attach_handle uuids. [root@localhost 0000:af:00.0]# lspci -nnn -D|grep 1eb8 0000:af:00.0 3D controller [0302]: NVIDIA Corporation TU104GL [Tesla T4] [10de:1eb8] (rev a1) [root@localhost mdev_supported_types]# cat nvidia-231/available_instances 8 3. And I am planning to start to work on mdev support once have time, do we need a specifications in nova for mdev support? Or in cyborg including everything in it? Regards, Yumeng
On Jun 26, 2020, at 9:28 PM, Sean Mooney <smooney@redhat.com> wrote:
On Fri, 2020-06-26 at 12:19 +0000, Feng, Shaohe wrote:
Yes, mtty is a really mdev. We can support a really mdev driver to support it firstly. If fake driver can easily to be extend to mdev, it will help nova to support new mdev attachment type more quickly. Seems no other more benefit.
So any suggestion about how the driver support both mixing types mdev and single type mdev. IMHO, for QOS/SLA, it is possible to support different type mdev in one physical device. yes its possible to do it
if we have a whitelist of of mdev types or devics that support mdevs we can jsut tag them as supporting multile type or a signel type.
for example we coudl add something like this
[mdevs] # devies provides a mapping between a device and an alias for that device # we could use something other then a sysfs path but thats not importnat in this example # you could have many device map the the same alais if you have multiple of the same acclerator devices={/sys/devices/virtual/mtty:mdev, /sys/devices/virtual/mdpy:mdpy}
# device_capablites is a way to declare atributes or traits for the deployable device_capablities {mdpy:{multi_mdev:true, traits=[CUSTOM_WHATEVER]}, mtty:{multi_mdev:false}}
# and mdev_types mapps the alais to the set of mdev types that are allowed for that alias. mdev_types={mtty:[mtty2], mdpy:[mdpy-hd, mdpy-vga]}
so in reallity if this wa a qat device for example it had multiel enyption and compuression enginges whiel they share pci bandwith my understandin is that the asmemtcy crypto hardware on qat operates indepentely of the compuression hardware so if intel impemetned mdev support for qat then you could do somethinkg like this
[mdevs] devices={/sys/bus/pci/...:qat} device_capablities {qat:{multi_mdev:true}} mdev_types={qat:[qat-crypto, qat- compression]}
if nvida support multile mdev types for gpus in the future you could do the exact same thing.
this is just an idea but that is more or less the direction i woudl go in. there are othere was to do it. but i think you need 3 pieces of info 1 which devices is the driver allowed to manage (this address the host useage fo a device vs a guests usage ) 2 does the device support multipel mdevs and optionally some other metadta 3 of the available mdev_types whcich ones are allowed to be used.
-----Original Message----- From: Sean Mooney <smooney@redhat.com> Sent: 2020年6月26日 19:15 To: yumeng bao <yumeng_bao@yahoo.com>; Feng, Shaohe <shaohe.feng@intel.com>; openstack-discuss@lists.openstack.org; Wang, Xin-ran <xin-ran.wang@intel.com> Cc: shhfeng@126.com Subject: Re: [cyborg][nova] Support flexible use scenario for Mdev(such as vGPU)
On Fri, 2020-06-26 at 05:39 +0000, yumeng bao wrote: Hi Shaohe and Sean,
Thanks for bringing up this discussion.
1. about the mdev whitelist, I agree and support the idea that cyborg should create a generic mdev driver and support the whitelist configuration of allowed mdevtypes and devices.
2. report the number of allocable resources to resource provider inventory I kind of prefer we report 1 inventory per VGPU. That is to say, admin config the supported mdevtype for the device in the cyborg.conf file, then cyborg report the avaliable_instance of the single selected mdevtype to resource provider inventory.
For the alternative, if we report 1 inventory per mdevtype, that means: 1)we need to report all, but only have one available inventory, and set reserverd = total for all the rest mdev types 2)when admin re-config the mdevtype, we still need to update the newly-selected type to available,while others remain reserved. This sounds like we will report quite a few redundant data to placement. But actually, we care the inventory of the selected type more than other types.
3.1 driver should also support a function to create certain mdev type, such as (V100D-1Q, V100D-2Q,)
well not really it will echo a uuid into a file in sysfs, that will triger the vendor driver to create the mdev but cyborg is not actully linking to the vendor driver and invoking a c function directly.
yes, in the cyborg management cycle, when reporting a mdev device, it just generates uuid(s) for this mdev, but not really create the mdev. I think we can create mdevs by two possible ways:
yes i think both will work
- solution1: A certain mdev type will not be actually created until nova-compute starts to write mdev info to XML and virt driver spawn VM. We can extend current accel_info attachement type to handle both add_accel_pci_device and add_accel_mdev_device in nova/virt/libvirt/driver.py, where before add_accel_mdev_device we can do create_mdev by just calling Nova.privsep.libvirt.create_mdev there.
in this case we would proably want the mdev attachemtn type to contain to the uuid of the mdev to create
- solution2: during ARQ binding process, cyborg creates the mdev by the way similar to Nova.privsep.libvirt.create_mdev
and in this case the attahcmet type would contaienr the mdev path or uuid this is my perfernce as it allows cyborg to create stable mdevs across reboot if it wants too. the mdev uuid can change in novas current code. at the moement that does not matter too much but it might be nice to for example use the deployable objectect uuid as the mdev uuid that way it would be easy to corratlte between the two.
3.2 and one more thing needs to be mentioned, that is we should avoid conflict with the existed mdev management logic in nova, so we may need to introduce a acc_mdev_flag here to check if the mdev request is from cyborg or not.
3.3 xml generation
At last we need to support a mdev handler for xml generation in nova, we can refer to the cyborg PCI handler in nova
yes i brought this up in the nova ptg as something we will ikely need to add supprot for. as i was suggesting above this can be done by adding a new attachment type which is mdev that jsut contains the uuid instead of the pci address.
+1.agree. supporting the new attachment type makes sense.
4. mdev fake driver support
IMHO, cyborg can support a mdev fake driver(similar to the current FPGA fake driver) for the mdev attachment type support in nova Or maybe we can extend the current fake driver support both mdev and pci devices
in parralel if the fake driver can be exteded to support mdev attachmets we can also use that as another way to validate the interaction. the difference between the two arrpochse is using the mdpy or mtty kernel moduels would allow us to actully use mdev attachmetn adn add the mdev to a vm where as for the fake driver approch we would not add the fake mdev to a the libvirt xml or to the vm. the mdpy and mtty sample kernel modules are intended for use in testing the mdev frame work without any hardware requiremnt so i think that is perfect for our ci system. its more work then just extending the fake driver but it has several benifits.
@Sean: "the mdpy and mtty sample kernel modules are intended for use in testing the mdev frame work without any hardware requiremnt so i think that is perfect for our ci system. " does this mean this can also be a easier way for third-party CI for mdev device?
i think if w euse the mdpy or mtty sample kernel modeuls we would not need a third party ci and could fully test mdev support in the first party ci. third party ci would only then be required for stateful mdev device that needed the a custom driver to do some intial programing or cleanup of the device that the generic mdev driver could not do.
the mtty sample module https://github.com/torvalds/linux/blob/master/samples/vfio-mdev/mtty.c emulates a virtual serial prot that acts basically as an echo server.
if you create the device # echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1001" > \ /sys/devices/virtual/mtty/mtty/mdev_supported_types/mtty-2/create and add it to qemu
-device vfio-pci,\ sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1001
then it will show up in the guest as a pci device with vendor and product id 4348:3253
# lspci -s 00:05.0 -xxvv 00:05.0 Serial controller: Device 4348:3253 (rev 10) (prog-if 02 [16550]) Subsystem: Device 4348:3253 Physical Slot: 5 Control: I/O+ Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Interrupt: pin A routed to IRQ 10 Region 0: I/O ports at c150 [size=8] Region 1: I/O ports at c158 [size=8] Kernel driver in use: serial 00: 48 43 53 32 01 00 00 02 10 02 00 07 00 00 00 00 10: 51 c1 00 00 59 c1 00 00 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 48 43 53 32 30: 00 00 00 00 00 00 00 00 00 00 00 00 0a 01 00 00
In the Linux guest VM, dmesg output for the device is as follows:
serial 0000:00:05.0: PCI INT A -> Link[LNKA] -> GSI 10 (level, high) -> IRQ 10 0000:00:05.0: ttyS1 at I/O 0xc150 (irq = 10) is a 16550A 0000:00:05.0: ttyS2 at I/O 0xc158 (irq = 10) is a 16550A
you can then use minicom or any serial console application to write data to ttyS1 or ttyS2 in the guest and the host mdev module will loop it back ascting as an echo server.
so we shoudl be able to add an optional tempest test to the cyborg tempest plugin to fully validate end to end fuctioning of generic mdev support including sshing into a vm that is using an mtty serial port and validating it loops back data fully testing the fature in the first party ci.
all we need to do is compile an modprob the mtty device using a devstack plugin in the gate job.
best regards, Yumeng
On Thursday, June 25, 2020, 08:43:11 PM GMT+8, Sean Mooney <smooney@redhat.com> wrote:
On Thu, 2020-06-25 at 00:31 +0000, Feng, Shaohe wrote:
Hi Yumeng and Xin-ran: Not sure you noticed that Sean Mooney has brought up that nova support mdev attachment type in the nova PTG, as follow:
yes i brought this up in the nova ptg as something we will ikely need to add supprot for. as i was suggesting above this can be done by adding a new attachment type which is mdev that jsut contains the uuid instead of the pci address.
IMHO, cyborg can support a mdev fake driver(similar to the current FPGA fake driver) for the mdev attachment type support in nova Or maybe we can extend the current fake driver support both mdev and pci devices
cyborg could support a fake mdev driver yes however i do think adding support do deploy with the mdpy drvier also makes sense. vm booted with a gpu backed by an mdpy mdev actually get a functionality frame buffer and you can view it in the default vnc console. i have manually verifed this.
i think this is out of scope of cyborg however. what i was planning to do if i ever get the time is is to create an mdpy devstack plug in that would compile and deploy the kernel module. with aht we can just add the plugin to a zuul job and hten whitelist the mdev types eitehr in nova or cyborg to do testing with it and the proposed stateless mdev driver.
in parralel if the fake driver can be exteded to support mdev attachmets we can also use that as another way to validate the interaction. the difference between the two arrpochse is using the mdpy or mtty kernel moduels would allow us to actully use mdev attachmetn adn add the mdev to a vm where as for the fake driver approch we would not add the fake mdev to a the libvirt xml or to the vm. the mdpy and mtty sample kernel modules are intended for use in testing the mdev frame work without any hardware requiremnt so i think that is perfect for our ci system. its more work then just extending the fake driver but it has several benifits.
BR Shaohe
-----Original Message----- From: Sean Mooney <smooney@redhat.com> Sent: 2020年6月23日 21:38 To: Feng, Shaohe <shaohe.feng@intel.com>; openstack-discuss@lists.openstack.org Cc: yumeng_bao@yahoo.com; shhfeng@126.com Subject: Re: [cyborg][nova] Support flexible use scenario for Mdev(such as vGPU)
On Tue, 2020-06-23 at 12:25 +0000, Feng, Shaohe wrote:
-----Original Message----- From: Sean Mooney <smooney@redhat.com> Sent: 2020年6月23日 19:48 To: Feng, Shaohe <shaohe.feng@intel.com>; openstack-discuss@lists.openstack.org Cc: yumeng_bao@yahoo.com; shhfeng@126.com Subject: Re: [cyborg][nova] Support flexible use scenario for Mdev(such as vGPU)
On Tue, 2020-06-23 at 05:50 +0000, Feng, Shaohe wrote:
Hi all,
Currently openstack support vGPU as follow: https://docs.openstack.org/nova/latest/admin/virtual-gpu.html
In order to support it, admin should plan ahead and configure the vGPU before deployable as follow: https://docs.openstack.org/nova/latest/configuration/config.html #dev ic es.enabled_vgpu_types This is very inconvenient for the administrator, this method has a limitation that a same PCI address does not provide two different types.
that is a matter of perspective there are those that prefer to check all fo there configuration into git and have a declaritive deployment and those that wish to drive everything via the api for the latter having to confiugre availbe resouce via cybogs api or placemtn would be consider very inconvenient.
Cyborg as an accelerator management tool is more suitable for mdev device management.
maybe but maybe not. i do think that cyborg should support mdevs i do not think we should have a dedicated vgpu mdev driver however. i think we should crate a stateless mdev driver that uses a similar whitelist of allowed mdevtypes and devices.
we did breifly discuss adding generic mdev support to nova in a future releae (w) or if we shoudl delegate that to cyborg but i would hesitate to do that if cyborg continutes its current design where driver have no configuraton element to whitelist device as it makes it much harder for deployment tools to properly configure a host declaritivly. [Feng, Shaohe] We did support config, such as our demo for fpga pre-program, we support config for our new drivers. And such as other accelerators, maybe the infra also need accelerators for acceleration not only VM needs. For example, cinder can use QAT for compress/crypto, and VM also can QAT. We need to configure which QATs are for infra and which for VMs.
yes qat is a good examlple of where shareign between host useage(cinder service) and guest usage(vms) could be required the sam could be true of gpus. typically many servers run headleas but not always and sometime you will want to resrve a gpu for the host to use. nics are another good example wehen we look at generic cpi passthough we need to select whic nics will be used by the vms and which will be used for host connectivity or fo hardware offloed ovs.
One solution as follow: Firstly, we need a vender driver(this can be a plugin), it is used to discovery its special devices and report them to placement for schedule. The difference from the current implementation is that: 1. report the mdev_supported_types as traits to resource provider. How to discover a GPU type: $ ls /sys/class/mdev_bus/*/mdev_supported_types /sys/class/mdev_bus/0000:84:00.0/mdev_supported_types: nvidia-35 nvidia-36 nvidia-37 nvidia-38 nvidia-39 nvidia-40 nvidia-41 nvidia-42 nvidia-43 nvidia-44 nvidia- 45 so here we report nvidia-3*, nvidia-4* as traits to resource provider. 2. Report the number of allocable resources instead of vGPU unit numbers to resource provider inventory Example for the NVidia V100 PCIe card (one GPU per board) : Virtual GPU Type Frame Buffer (Gbytes) Maximum vGPUs per GPU Maximum vGPUs per Board V100D-32Q 32 1 1 V100D-16Q 16 2 2 V100D-8Q 8 4 4 V100D-4Q 4 8 8 V100D-2Q 2 16 16 V100D-1Q 1 32 32 so here we report 32G Buffers(an example, maybe other resources) to resource provider inventory
in this specific example that would not be a good idea. the V100 does not support mixing mdev types on the same gpu so if you allocate a V100D-16Q instace using 16G of the buffer you cannot then allocate 2 V100D-8Q vgpu instance to consume the remaining 16G other mdev based device may not have this limitation but nvida only support 1 active mdevtype per phsyical gpu.
not that the ampere generation has a dynmaic sriov based muilti instance gpu technology which kind of allow resouce based subdivietion of the device but it does not quite work the way you are describing above.
so you can report inventories of custom resource classes of for each of the mdev types or a single inventory of VGPU with traits modelling the avialble mdevs.
with the trait approch before a vgpu is allocated you report all traits and the total count for the inventory would be the hightest amount e.g. 32 in this case above then when a gpu is allocated you need to update the reserved value and remove the other traits.
[Feng, Shaohe] For V100 does not support mixing mdev types, that need to remove the other traits. So any suggestion about how a generic driver support both mixing types mdev and single type mdev?
if you have 1 inventory per mdev type then you set reserved = total for all inventories for the other mdev types but no need for traits. [Feng, Shaohe] Oh, really sorry, I should choose a good example.
the sample mdpy kernel module which create a basic virtual graphice deice support multiple mdev type for different resolutions https://github.com/torvalds/linux/blob/f97c81dc6ca5996560b3944064f63 fc87eb18d00/samples/vfio-mdev/mdpy.c i belive it also support consuming each mdev type independtly. so if you dont want to use real hardware as an example ther eare at least sample devicce that support having multiple active mdevs. i would also suggest we use this device for testing in the upstream gate.
i started creating a jobs to test novas vgpu support with this smaple device a few months back but we need to make a few small change to make it work and i was not sure it was approate to modify the nova code just to get ci working with a fake device. currently we make 1 assumtion about that parent of the mdev being a pci deivce which is not true in the kernel module case.
but form a cyborg perspective you can avoid that mistake since mdevs can be created for device on any bus like usb or upi as well as pcie. [Feng, Shaohe] Good suggestion.
3. driver should also support a function to create certain mdev type, such as (V100D-1Q, V100D-2Q,) Secondly, we need a mdev extend ARQ(it can be a plugin):
no it should not be a plugin it should jsut be another attachment type.
Here is an example for fpga ext arq: https://review.opendev.org/#/c/681005/26/cyborg/objects/extarq/f pga_ ex t_arq.py@206 The difference is that, we replace the _do_programming to _do_create_mdev For _do_programming, it is used to create a new FPGA function. For _do_create_mdev, it is used to create a new type mdev, it will call the implementation function in vendor driver.
well not really it will echo a uuid into a file in sysfs, that will triger the vendor driver to create the mdev but cyborg is not actully linking to the vendor driver and invoking a c function directly.
At last we need to support a mdev handler for xml generation in nova, we can refer to the cyborg PCI handler in nova
yes i brought this up in the nova ptg as something we will ikely need to add supprot for. as i was suggesting above this can be done by adding a new attachment type which is mdev that jsut contains the uuid instead of the pci address.
So after the above changes: Admin can create different SLA devices profiles such as: {“name”: “Gold_vGPU”, “groups”: [ {“resources:vGPU_BUFFERS”: “16”, “traits: V100D-16Q,”: “required”, }] } And {“name”: “Iron_vGPU”, “groups”: [ {“resources:vGPU_BUFFERS”: “1”, “traits: V100D-1Q,”: “required”, }] } Then a tenant can use Gold_vGPU to create with a VM with V100D-16Q vGPU And another tenant can use Iron_vGPU to create with a VM with V100D-1Q vGPU
it cannot do this on the same physical gpus but yes that could work or you could do {“name”: “Gold_vGPU”, “groups”: [{“resources:CUSTOM_V100D-16Q”: “1”,}]}
currently for nova we just do resouces:vgpu and you can optionally do trait:CUSTOM_
by the way we cannot use standard resouce classes or raits for the mdev types as these are arbitary stings chosen by vendor that can potential change based on kernel of driver version so we shoudl not add them to os-traits or os- resouce-classes and in stead shoudl use CUSTOM_ resouce classes ro traits for them. [Feng, Shaohe] Yes use use CUSTOM_ for them.
When ARQ binding during the VM creating, the Cyborg will call the vendor driver to create expected mdev vGPU. And these 2 mdev vGPU can be on same physical GPU card.
The mdev extend ARQ and vendor driver can be plugin, they are loose couple with the upstream code.
it should not be a plugin its a generic virtualisation attachment mode that can be used by any device we shoudl standardise the attachemt handel in core cyborg and add support for that attachmet model in nova. we already have support for generating the mdev xml so we would only need to wire up the handeling for the attachment type in the code that currently handles the pci attahcmeht type. [Feng, Shaohe] Yes, we will support stand generic ARQ. Only extend ARQ for some special accelerators, FPGA is an example.
i am not sure we need to extend ARQ for FPGA but perhaps at some point. nova does nto support plugins in general so as long as the info we need to request or recive does not vary based on cyborg plugins i guess it coudl be ok but outside of the driver layer i woudl perfosnally avoid intoduciing plugins to cyborg.
So the downstream can get the upstream code to customize the own mdev extend ARQ and vendor driver. Here vGPU is just an example, it can be other mdev devices.
yes so because this can be used for other device that is why i would counter propose that we should create a stateless mdev driver to cover devices that do not require programming or state managment and have a config driven interface to map mdev types to custom resouce classes and/or triats and we also need to declare per device it it support independent pools of each mdev type or if they consome the same ressource. .i.e. to it work like nvidea's v100s where you can only have 1 mdev type per physical device or if it works like the same device where you can have 1 device and multiple mdev types that can be consumed in parallel.
both approches are valid although i personally prefer when they are independent pools since that is eaier to reason about.
you could also support a non config driven approch where we use atibutes on the deployable to describe the mapping of the mdev type to resouce class and if its independetly consumable too i guess but that seams much more combersome to manage. [Feng, Shaohe] Both stateless mdev driver and non config driven can be support. If these cannot satisfy the users, users add their special mdev driver by themselves.
well if i put my downstream had on in terms of productization we are currntly debating if we want to include cyborg in a futrue release of Redhat openstack plathform. at the moment it is not planed for our next major release winch is osp 17 and is in consideration for osp 18. one of the concerns we have with adding cyborg to a futre releae is the lack of a config driven approch. it is not a blocker but haveing an api only based apporch whil it has some advantages also has several drawback not lest of which is supportablity and day 1 and day 2 operational complexity.
for example we have a tool that customer use when reporting bugs called sosreport which automates teh collection of logs config and other system inforamtion. it can also invoke some command like virsh list etc but adding a new config and log to collect is signifcantly less work then adding a new command that needs to discover the node uuid then quiry placement and cyborg apis to determin what acclereator are avaiable and how they are configred. so api only configred service are harder for distos to support from a day perspective when things go wrong. form a day one persepctive it is also simpler for installer tools to template out a config file then it is to inovke commands imperitivaly against an api.
so i understand that form an operators perpesctive invoking an api to do all config managment remotely might be quite appeling but it comes with trade offs. upstreasm should not be governed by downstream distro concerns but we should be mindful not to make it overly hard to deploy and manage the service as it raise the barrier to integrate and if that is too high then that can restul in not being supported in the long run. that said having a service that you just deploy and have no configuration to do would be nice too but if we tell our user that after they deploy cyborg the must then iterate of every deployable and decide dit they should enable it and what atibutes to add to it to schduler correctly i think the complexity might be to high for many.
BR Shaohe Feng
participants (3)
-
Feng, Shaohe
-
Sean Mooney
-
yumeng bao