[cyborg][nova] Support flexible use scenario for Mdev(such as vGPU)

Sean Mooney smooney at redhat.com
Fri Jun 26 11:15:19 UTC 2020


On Fri, 2020-06-26 at 05:39 +0000, yumeng bao wrote:
> Hi Shaohe and Sean,
> 
> 
> Thanks for bringing up this discussion.
> 
> 
> 1. about the mdev whitelist, I agree and support the idea that cyborg should create a generic mdev driver and support
> the whitelist configuration of allowed mdevtypes and devices.
> 
> 
> 2. report the number of allocable resources to resource provider inventory
> I kind of prefer we report 1 inventory per VGPU. That is to say, admin config the supported mdevtype for the device in
> the cyborg.conf file, then cyborg report the avaliable_instance of the single selected mdevtype to resource provider
> inventory. 
> 
> 
> For the alternative, if we report 1 inventory per mdevtype, that means: 1)we need to report all, but only have one
> available inventory, and set reserverd = total for all the rest mdev types 2)when admin re-config the mdevtype, we
> still need to update the newly-selected type to available,while others remain reserved. This sounds like we will
> report quite a few redundant data to placement. But actually, we care the inventory of the selected type more than
> other types.
> 
> 
> 3.1 driver should also support a function to create certain mdev type, such as (V100D-1Q,  V100D-2Q,)
> > well not really it will echo a uuid into a file in sysfs, that will triger the vendor driver to create the mdev but
> > cyborg is not actully linking to the vendor driver and invoking a c function directly.
> 
> 
> yes, in the cyborg management cycle, when reporting a mdev device, it just generates uuid(s) for this mdev, but not
> really create the mdev. I think we can create mdevs by two possible ways:
> 
yes i think both will work
> 
> - solution1: A certain mdev type will not be actually created until nova-compute starts to write mdev info to XML and
> virt driver spawn VM. We can extend current accel_info attachement type to handle both add_accel_pci_device and
> add_accel_mdev_device in nova/virt/libvirt/driver.py, where before add_accel_mdev_device we can do create_mdev by just
> calling Nova.privsep.libvirt.create_mdev there.
in this case we would proably want the mdev attachemtn type to contain to the uuid of the mdev to create
> 
> 
> - solution2: during ARQ binding process, cyborg creates the mdev by the way similar to
> Nova.privsep.libvirt.create_mdev
and in this case the attahcmet type would contaienr the mdev path or uuid
this is my perfernce as it allows cyborg to create stable mdevs across reboot if it wants too.
the mdev uuid can change in novas current code. at the moement that does not matter too much but
it might be nice to for example use the deployable objectect uuid as the mdev uuid
that way it would be easy to corratlte between the two.

> 
> 
> 3.2 and one more thing needs to be mentioned, that is we should avoid conflict with the existed mdev management logic
> in nova,
> so we may need to introduce a acc_mdev_flag here to check if the mdev request is from cyborg or not.
> 
> 
> 3.3 xml generation
> > > At last we need to support a mdev handler for xml generation in nova, we can refer to the cyborg PCI handler in
> > > nova
> > >  
> > 
> > yes i brought this up in the nova ptg as something we will ikely need to add supprot for. as i was suggesting above
> > this can be done by adding a new attachment type which is mdev that jsut contains the uuid instead of the pci
> > address.
> 
> 
> +1.agree. supporting the new attachment type makes sense.
> 
> 
> 4. mdev fake driver support 
> > > IMHO, cyborg can support a mdev fake driver(similar to the current FPGA fake driver) for the mdev attachment type
> > > support in nova
> > > Or maybe we can extend the current fake driver support both mdev and pci devices
> 
> 
> > in parralel if the fake driver can be exteded to support mdev attachmets we can also use that as another way to
> > validate
> > the interaction. the difference between the two arrpochse is using the mdpy or mtty kernel moduels would allow us to
> > actully use mdev attachmetn adn add the mdev to a vm where as for the fake driver approch we would not add the fake
> > mdev
> > to a the libvirt xml or to the vm.  the mdpy and mtty sample kernel modules are intended for use in testing the mdev
> > frame work without any hardware requiremnt so i think that is perfect for our ci system. its more work then just
> > extending the fake driver but it has several benifits.
> 
> 
> @Sean: "the mdpy and mtty sample kernel modules are intended for use in testing the mdev frame work without any
> hardware requiremnt so i think that is perfect for our ci system. "
> does this mean this can also be a easier way for third-party CI for mdev device?

i think if w euse the mdpy or mtty sample kernel modeuls we would not need a third party ci and could fully test mdev
support in the first party ci. third party ci would only then be required for stateful mdev device that needed the a
custom driver to do some intial programing or cleanup of the device that the generic mdev driver could not do.

the mtty sample module https://github.com/torvalds/linux/blob/master/samples/vfio-mdev/mtty.c
emulates a virtual serial prot that acts basically as an echo server.

if you create the device 
  # echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1001" >	\
              /sys/devices/virtual/mtty/mtty/mdev_supported_types/mtty-2/create
and add it to qemu

-device vfio-pci,\
      sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1001

then it will show up in the guest as a pci device with vendor and product id 4348:3253


# lspci -s 00:05.0 -xxvv
     00:05.0 Serial controller: Device 4348:3253 (rev 10) (prog-if 02 [16550])
             Subsystem: Device 4348:3253
             Physical Slot: 5
             Control: I/O+ Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr-
     Stepping- SERR- FastB2B- DisINTx-
             Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
     <TAbort- <MAbort- >SERR- <PERR- INTx-
             Interrupt: pin A routed to IRQ 10
             Region 0: I/O ports at c150 [size=8]
             Region 1: I/O ports at c158 [size=8]
             Kernel driver in use: serial
     00: 48 43 53 32 01 00 00 02 10 02 00 07 00 00 00 00
     10: 51 c1 00 00 59 c1 00 00 00 00 00 00 00 00 00 00
     20: 00 00 00 00 00 00 00 00 00 00 00 00 48 43 53 32
     30: 00 00 00 00 00 00 00 00 00 00 00 00 0a 01 00 00

     In the Linux guest VM, dmesg output for the device is as follows:

     serial 0000:00:05.0: PCI INT A -> Link[LNKA] -> GSI 10 (level, high) -> IRQ 10
     0000:00:05.0: ttyS1 at I/O 0xc150 (irq = 10) is a 16550A
     0000:00:05.0: ttyS2 at I/O 0xc158 (irq = 10) is a 16550A

you can then use minicom or any serial console application to write data to ttyS1 or ttyS2 in the guest
and the host mdev  module will loop it back ascting as an echo server.

so we shoudl be able to add an optional tempest test to the cyborg tempest plugin to fully validate end to end
fuctioning of generic mdev support including sshing into a vm that is using an mtty serial port and validating
it loops back data fully testing the fature in the first party ci.

all we need to do is compile an modprob the mtty device using a devstack plugin in the gate job.


> 
> 
> 
> best regards,
> Yumeng
> 
> 
> 
> On Thursday, June 25, 2020, 08:43:11 PM GMT+8, Sean Mooney <smooney at redhat.com> wrote: 
> 
> 
> 
> 
> 
> On Thu, 2020-06-25 at 00:31 +0000, Feng, Shaohe wrote:
> > Hi Yumeng and Xin-ran:
> > Not sure you noticed that Sean Mooney has brought up that nova support mdev attachment type in the nova PTG, as
> > follow:
> > 
> > > yes i brought this up in the nova ptg as something we will ikely need 
> > > to add supprot for. as i was suggesting above this can be done by adding a new attachment type which is mdev that
> > > jsut contains the uuid instead of the pci address.
> > 
> > IMHO, cyborg can support a mdev fake driver(similar to the current FPGA fake driver) for the mdev attachment type
> > support in nova 
> > Or maybe we can extend the current fake driver support both mdev and pci devices
> 
> cyborg could support a fake mdev driver yes however i do think adding support do deploy with the mdpy drvier also
> makes
> sense. vm booted with a gpu backed by an mdpy mdev actually get a functionality frame buffer and you can view it in
> the
> default vnc console. i have manually verifed this.
> 
> i think this is out of scope of cyborg however. what i was planning to do if i ever get the time is is to create an
> mdpy
> devstack plug in that would compile and deploy the kernel module. with aht we can just add the plugin to a zuul job
> and hten whitelist the mdev types eitehr in nova or cyborg to do testing with it and the proposed stateless mdev
> driver.
> 
> in parralel if the fake driver can be exteded to support mdev attachmets we can also use that as another way to
> validate
> the interaction. the difference between the two arrpochse is using the mdpy or mtty kernel moduels would allow us to
> actully use mdev attachmetn adn add the mdev to a vm where as for the fake driver approch we would not add the fake
> mdev
> to a the libvirt xml or to the vm.  the mdpy and mtty sample kernel modules are intended for use in testing the mdev
> frame work without any hardware requiremnt so i think that is perfect for our ci system. its more work then just
> extending the fake driver but it has several benifits.
> 
> > 
> > 
> > BR
> > Shaohe
> > 
> > -----Original Message-----
> > From: Sean Mooney <smooney at redhat.com>
> > Sent: 2020年6月23日 21:38
> > To: Feng, Shaohe <shaohe.feng at intel.com>; openstack-discuss at lists.openstack.org
> > Cc: yumeng_bao at yahoo.com; shhfeng at 126.com
> > Subject: Re: [cyborg][nova] Support flexible use scenario for Mdev(such as vGPU)
> > 
> > On Tue, 2020-06-23 at 12:25 +0000, Feng, Shaohe wrote:
> > > 
> > > -----Original Message-----
> > > From: Sean Mooney <smooney at redhat.com>
> > > Sent: 2020年6月23日 19:48
> > > To: Feng, Shaohe <shaohe.feng at intel.com>; 
> > > openstack-discuss at lists.openstack.org
> > > Cc: yumeng_bao at yahoo.com; shhfeng at 126.com
> > > Subject: Re: [cyborg][nova] Support flexible use scenario for 
> > > Mdev(such as vGPU)
> > > 
> > > On Tue, 2020-06-23 at 05:50 +0000, Feng, Shaohe wrote:
> > > > Hi all,
> > > > 
> > > > Currently openstack support vGPU as follow:
> > > > https://docs.openstack.org/nova/latest/admin/virtual-gpu.html
> > > > 
> > > > In order to support it, admin should plan ahead and configure the vGPU before deployable as follow:
> > > > https://docs.openstack.org/nova/latest/configuration/config.html#dev
> > > > ic es.enabled_vgpu_types This is very inconvenient for the 
> > > > administrator, this method has a limitation that a same PCI address 
> > > > does not provide two different types.
> > > 
> > > that is a matter of perspective there are those that prefer to check 
> > > all fo there configuration into git and have a declaritive deployment 
> > > and those that wish to drive everything via the api for the latter having to confiugre availbe resouce via cybogs
> > > api or placemtn would be consider very inconvenient.
> > > 
> > > > 
> > > > Cyborg as an accelerator management tool is more suitable for mdev device management.
> > > 
> > > maybe but maybe not.
> > > i do think that cyborg should support mdevs i do not think we should have a dedicated vgpu mdev driver however.
> > > i think we should crate a stateless mdev driver that uses a similar whitelist of allowed mdevtypes and devices.
> > > 
> > > we did breifly discuss adding generic mdev support to nova in a future 
> > > releae (w) or if we shoudl delegate that to cyborg but i would 
> > > hesitate to do that if cyborg continutes its current design where driver have no configuraton element to whitelist
> > > device as it makes it much harder for deployment tools to properly configure a host declaritivly.
> > > [Feng, Shaohe]
> > > We did support config, such as our demo for fpga pre-program, we support config for our new drivers.
> > > And such as other accelerators, maybe the infra also need accelerators for acceleration not only VM needs. 
> > > For example, cinder can use QAT for compress/crypto, and VM also can QAT. 
> > > We need to configure which QATs are for infra and which for VMs.
> > 
> > yes qat is a good examlple of where shareign between host useage(cinder service) and guest usage(vms) could be
> > required the sam could be true of gpus. typically many servers run headleas but not always and sometime you will
> > want
> > to resrve a gpu for the host to use. nics are another good example wehen we look at generic cpi passthough we need
> > to
> > select whic nics will be used by the vms and which will be used for host connectivity or fo hardware offloed ovs.
> > 
> > > 
> > > 
> > > > 
> > > > One solution as follow:
> > > > Firstly, we need a vender driver(this can be a plugin),  it is used 
> > > > to discovery its special devices and report them to placement for schedule.
> > > > The difference from the current implementation is that:
> > > > 1. report the mdev_supported_types as traits to resource provider.
> > > > How to discover a GPU type:
> > > > $ ls /sys/class/mdev_bus/*/mdev_supported_types
> > > > /sys/class/mdev_bus/0000:84:00.0/mdev_supported_types:
> > > > nvidia-35  nvidia-36  nvidia-37  nvidia-38  nvidia-39  nvidia-40
> > > > nvidia-41  nvidia-42  nvidia-43  nvidia-44  nvidia-
> > > > 45
> > > > so here we report nvidia-3*, nvidia-4* as traits to resource provider.
> > > > 2. Report the number of allocable resources instead of vGPU unit 
> > > > numbers to resource provider inventory Example for the NVidia V100 PCIe card (one GPU per board) :
> > > > Virtual GPU Type    Frame Buffer (Gbytes)      Maximum vGPUs per GPU        Maximum vGPUs per Board
> > > > V100D-32Q              32                        1                            1
> > > > V100D-16Q              16                        2                            2
> > > > V100D-8Q              8                        4                            4
> > > > V100D-4Q              4                          8                            8
> > > > V100D-2Q              2                        16                          16
> > > > V100D-1Q              1                        32                          32
> > > > so here we report 32G Buffers(an example, maybe other resources) to 
> > > > resource provider inventory
> > > 
> > > in this specific example that would not be a good idea.
> > > the V100 does not support mixing mdev types on the same gpu so if you 
> > > allocate a V100D-16Q instace using 16G of the buffer you cannot then 
> > > allocate 2 V100D-8Q vgpu instance to consume the remaining 16G other mdev based device may not have this
> > > limitation
> > > but nvida only support 1 active mdevtype per phsyical gpu.
> > > 
> > > not that the ampere generation has a dynmaic sriov based muilti 
> > > instance gpu technology which kind of allow resouce based subdivietion of the device but it does not quite work
> > > the
> > > way you are describing above.
> > > 
> > > so you can report inventories of custom resource classes of for each 
> > > of the mdev types or a single inventory of VGPU with traits modelling the avialble mdevs.
> > > 
> > > with the trait approch before a vgpu is allocated you report all 
> > > traits and the total count for the inventory would be the hightest 
> > > amount e.g. 32 in this case above then when a gpu is allocated you need to update the reserved value and remove
> > > the
> > > other traits.
> > 
> > [Feng, Shaohe]
> > For V100 does not support mixing mdev types, that need to remove the other traits. 
> > So any suggestion about how a generic driver support both mixing types mdev and single type mdev? 
> > 
> > 
> > > 
> > > if you have 1 inventory per mdev type then you set reserved = total 
> > > for all inventories for the other mdev types but no need for traits.
> > > [Feng, Shaohe]
> > > Oh, really sorry, I should choose a good example.
> > 
> > the sample mdpy kernel module which create a basic virtual graphice deice support multiple mdev type for different
> > resolutions https://github.com/torvalds/linux/blob/f97c81dc6ca5996560b3944064f63fc87eb18d00/samples/vfio-mdev/mdpy.c
> > i belive it also support consuming each mdev type independtly.
> > so if you dont want to use real hardware as an example ther eare at least sample devicce that support having
> > multiple
> > active mdevs. i would also suggest we use this device for testing in the upstream gate.
> > 
> > i started creating a jobs to test novas vgpu support with this smaple device a few months back but we need to make a
> > few small change to make it work and i was not sure it was approate to modify the nova code just to get ci working
> > with a fake device. currently we make 1 assumtion about that parent of the mdev being a pci deivce which is not true
> > in the kernel module case. 
> > 
> > but form a cyborg perspective you can avoid that mistake since mdevs can be created for device on any bus like usb
> > or
> > upi as well as pcie.
> > [Feng, Shaohe] 
> > Good suggestion.
> > > 
> > > > 3. driver should also support a function to create certain mdev 
> > > > type, such as (V100D-1Q,  V100D-2Q,) Secondly, we need a mdev extend ARQ(it can be a plugin):
> > > 
> > > no it should not be a plugin it should jsut be another attachment type.
> > > > Here is an example for fpga ext arq:
> > > > https://review.opendev.org/#/c/681005/26/cyborg/objects/extarq/fpga_
> > > > ex
> > > > t_arq.py at 206 The difference is that, we replace the _do_programming 
> > > > to _do_create_mdev For _do_programming, it is used to create a new 
> > > > FPGA function.
> > > > For _do_create_mdev, it is used to create a new type mdev, it will 
> > > > call the implementation function in vendor driver.
> > > 
> > > well not really it will echo a uuid into a file in sysfs, that will 
> > > triger the vendor driver to create the mdev but cyborg is not actully linking to the vendor driver and invoking a
> > > c
> > > function directly.
> > > 
> > > > 
> > > > At last we need to support a mdev handler for xml generation in 
> > > > nova, we can refer to the cyborg PCI handler in nova
> > > > 
> > > 
> > > yes i brought this up in the nova ptg as something we will ikely need 
> > > to add supprot for. as i was suggesting above this can be done by adding a new attachment type which is mdev that
> > > jsut contains the uuid instead of the pci address.
> > > 
> > > > So after the above changes:
> > > > Admin can create different SLA devices profiles such as:
> > > > {“name”: “Gold_vGPU”,
> > > >         “groups”: [
> > > > {“resources:vGPU_BUFFERS”: “16”,
> > > >         “traits: V100D-16Q,”: “required”,
> > > >               }]
> > > > }
> > > > And
> > > > {“name”: “Iron_vGPU”,
> > > >         “groups”: [
> > > > {“resources:vGPU_BUFFERS”: “1”,
> > > >         “traits: V100D-1Q,”: “required”,
> > > >               }]
> > > > }
> > > > Then a tenant can use Gold_vGPU to create with a VM with V100D-16Q 
> > > > vGPU And another tenant can use Iron_vGPU to create with a VM with 
> > > > V100D-1Q vGPU
> > > 
> > > it cannot do this on the same physical gpus but yes that could work or 
> > > you could do
> > > {“name”: “Gold_vGPU”, “groups”: [{“resources:CUSTOM_V100D-16Q”: 
> > > “1”,}]}
> > > 
> > > currently for nova we just do resouces:vgpu and you can optionally do 
> > > trait:CUSTOM_
> > > 
> > > by the way  we cannot use standard resouce classes or raits for the 
> > > mdev types as these are arbitary stings chosen by vendor that can 
> > > potential change based on kernel of driver version so we shoudl not add them to os-traits or os- resouce-classes
> > > and
> > > in stead shoudl use CUSTOM_ resouce classes ro traits for them.
> > > [Feng, Shaohe]
> > > Yes use use CUSTOM_ for them.
> > > > 
> > > > When ARQ binding during the VM creating, the Cyborg will call the vendor driver to create expected mdev vGPU.
> > > > And these 2 mdev vGPU can be on same physical GPU card.
> > > > 
> > > > The mdev extend ARQ and vendor driver can be plugin, they are loose couple with the upstream code.
> > > 
> > > it should not be a plugin its a generic virtualisation attachment mode 
> > > that can be used by any device we shoudl standardise the attachemt handel in core cyborg and add support for that
> > > attachmet model in nova.
> > > we already have support for generating the mdev xml so we would only 
> > > need to wire up the handeling for the attachment type in the code that currently handles the pci attahcmeht type.
> > > [Feng, Shaohe]
> > > Yes, we will support stand generic ARQ.
> > > Only extend ARQ for some special accelerators, FPGA is an example.
> > 
> > i am not sure we need to extend ARQ for FPGA but perhaps at some point.
> > nova does nto support plugins in general so as long as the info we need to request or recive does not vary based on
> > cyborg plugins i guess it coudl be ok but outside of the driver layer i woudl perfosnally avoid intoduciing plugins
> > to
> > cyborg.
> > > > So the downstream can get the upstream code to customize the own mdev extend ARQ and vendor driver.
> > > > Here vGPU is just an example, it can be other mdev devices.
> > > 
> > > yes so because this can be used for other device that is why i would 
> > > counter propose that we should create a stateless mdev driver to cover 
> > > devices that do not require programming or state managment and have a 
> > > config driven interface to map mdev types to custom resouce classes 
> > > and/or triats and we also need to declare per device it it support 
> > > independent pools of each mdev type or if they consome the same ressource. .i.e. to it work like nvidea's v100s
> > > where you can only have 1 mdev type per physical device or if it works like the same device where you can have 1
> > > device and multiple mdev types that can be consumed in parallel.
> > > 
> > > 
> > > both approches are valid although i personally prefer when they are 
> > > independent pools since that is eaier to reason about.
> > > 
> > > you could also support a non config driven approch where we use 
> > > atibutes on the deployable to describe the mapping of the mdev type to 
> > > resouce class and if its independetly consumable too i guess but that seams much more combersome to manage.
> > > [Feng, Shaohe]
> > > Both stateless mdev driver and non config driven can be support. 
> > > If these cannot satisfy the users, users add their special mdev driver by themselves.
> > 
> > well if i put my downstream had on in terms of productization we are currntly debating if we want to include cyborg
> > in
> > a futrue release of Redhat openstack plathform. at the moment it is not planed for our next major release winch is
> > osp
> > 17 and is in consideration for osp 18. one of the concerns we have with adding cyborg to a futre releae is the lack
> > of
> > a config driven approch. it is not a blocker but haveing an api only based apporch whil it has some advantages also
> > has several drawback not lest of which is supportablity and day 1 and day 2 operational complexity. 
> > 
> > for example we have a tool that customer use when reporting bugs called sosreport which automates teh collection of
> > logs config and other system inforamtion. it can also invoke some command like virsh list etc but adding a new
> > config
> > and log to collect is signifcantly less work then adding a new command that needs to discover the node uuid then
> > quiry
> > placement and cyborg apis to determin what acclereator are avaiable and how they are configred. so api only
> > configred
> > service are harder for distos to support from a day perspective when things go wrong. form a day one persepctive it
> > is
> > also simpler for installer tools to template out a config file then it is to inovke commands imperitivaly against an
> > api.
> > 
> > so i understand that form an operators perpesctive invoking an api to do all config managment remotely might be
> > quite
> > appeling but it comes with trade offs. upstreasm should not be governed by downstream distro concerns but we should
> > be
> > mindful not to make it overly hard to deploy and manage the service as it raise the barrier to integrate and if that
> > is too high then that can restul in not being supported in the long run. that said having a service that you just
> > deploy and have no configuration to do would be nice too but if we tell our user that after they deploy cyborg the
> > must then iterate of every deployable and decide dit they should enable it and what atibutes to add to it to
> > schduler
> > correctly i think the complexity might be to high for many.
> > >   
> > > 
> > > 
> > > > 
> > > > BR
> > > > Shaohe Feng
> > > 
> > > 
> > 
> > 
> 
> 




More information about the openstack-discuss mailing list