device compatibility interface for live migration with assigned devices
hi folks, we are defining a device migration compatibility interface that helps upper layer stack like openstack/ovirt/libvirt to check if two devices are live migration compatible. The "devices" here could be MDEVs, physical devices, or hybrid of the two. e.g. we could use it to check whether - a src MDEV can migrate to a target MDEV, - a src VF in SRIOV can migrate to a target VF in SRIOV, - a src MDEV can migration to a target VF in SRIOV. (e.g. SIOV/SRIOV backward compatibility case)
The upper layer stack could use this interface as the last step to check if one device is able to migrate to another device before triggering a real live migration procedure. we are not sure if this interface is of value or help to you. please don't hesitate to drop your valuable comments.
(1) interface definition The interface is defined in below way:
__ userspace /\ \ / \write / read \ ________/__________ ___|/_____________ | migration_version | | migration_version |-->check migration --------------------- --------------------- compatibility device A device B
a device attribute named migration_version is defined under each device's sysfs node. e.g. (/sys/bus/pci/devices/0000:00:02.0/$mdev_UUID/migration_version). userspace tools read the migration_version as a string from the source device, and write it to the migration_version sysfs attribute in the target device.
The userspace should treat ANY of below conditions as two devices not compatible: - any one of the two devices does not have a migration_version attribute - error when reading from migration_version attribute of one device - error when writing migration_version string of one device to migration_version attribute of the other device
The string read from migration_version attribute is defined by device vendor driver and is completely opaque to the userspace. for a Intel vGPU, string format can be defined like "parent device PCI ID" + "version of gvt driver" + "mdev type" + "aggregator count".
for an NVMe VF connecting to a remote storage. it could be "PCI ID" + "driver version" + "configured remote storage URL"
for a QAT VF, it may be "PCI ID" + "driver version" + "supported encryption set".
(to avoid namespace confliction from each vendor, we may prefix a driver name to each migration_version string. e.g. i915-v1-8086-591d-i915-GVTg_V5_8-1)
(2) backgrounds
The reason we hope the migration_version string is opaque to the userspace is that it is hard to generalize standard comparing fields and comparing methods for different devices from different vendors. Though userspace now could still do a simple string compare to check if two devices are compatible, and result should also be right, it's still too limited as it excludes the possible candidate whose migration_version string fails to be equal. e.g. an MDEV with mdev_type_1, aggregator count 3 is probably compatible with another MDEV with mdev_type_3, aggregator count 1, even their migration_version strings are not equal. (assumed mdev_type_3 is of 3 times equal resources of mdev_type_1).
besides that, driver version + configured resources are all elements demanding to take into account.
So, we hope leaving the freedom to vendor driver and let it make the final decision in a simple reading from source side and writing for test in the target side way.
we then think the device compatibility issues for live migration with assigned devices can be divided into two steps: a. management tools filter out possible migration target devices. Tags could be created according to info from product specification. we think openstack/ovirt may have vendor proprietary components to create those customized tags for each product from each vendor. e.g. for Intel vGPU, with a vGPU(a MDEV device) in source side, the tags to search target vGPU are like: a tag for compatible parent PCI IDs, a tag for a range of gvt driver versions, a tag for a range of mdev type + aggregator count
for NVMe VF, the tags to search target VF may be like: a tag for compatible PCI IDs, a tag for a range of driver versions, a tag for URL of configured remote storage.
b. with the output from step a, openstack/ovirt/libvirt could use our proposed device migration compatibility interface to make sure the two devices are indeed live migration compatible before launching the real live migration process to start stream copying, src device stopping and target device resuming. It is supposed that this step would not bring any performance penalty as -in kernel it's just a simple string decoding and comparing -in openstack/ovirt, it could be done by extending current function check_can_live_migrate_destination, along side claiming target resources.[1]
[1] https://specs.openstack.org/openstack/nova-specs/specs/stein/approved/libvir...
Thanks Yan
On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote:
hi folks, we are defining a device migration compatibility interface that helps upper layer stack like openstack/ovirt/libvirt to check if two devices are live migration compatible. The "devices" here could be MDEVs, physical devices, or hybrid of the two. e.g. we could use it to check whether
- a src MDEV can migrate to a target MDEV,
- a src VF in SRIOV can migrate to a target VF in SRIOV,
- a src MDEV can migration to a target VF in SRIOV. (e.g. SIOV/SRIOV backward compatibility case)
The upper layer stack could use this interface as the last step to check if one device is able to migrate to another device before triggering a real live migration procedure. we are not sure if this interface is of value or help to you. please don't hesitate to drop your valuable comments.
(1) interface definition The interface is defined in below way:
__ userspace /\ \ / \write / read \
________/__________ ___|/_____________ | migration_version | | migration_version |-->check migration --------------------- --------------------- compatibility device A device B
a device attribute named migration_version is defined under each device's sysfs node. e.g. (/sys/bus/pci/devices/0000:00:02.0/$mdev_UUID/migration_version). userspace tools read the migration_version as a string from the source device, and write it to the migration_version sysfs attribute in the target device.
The userspace should treat ANY of below conditions as two devices not compatible:
- any one of the two devices does not have a migration_version attribute
- error when reading from migration_version attribute of one device
- error when writing migration_version string of one device to migration_version attribute of the other device
The string read from migration_version attribute is defined by device vendor driver and is completely opaque to the userspace. for a Intel vGPU, string format can be defined like "parent device PCI ID" + "version of gvt driver" + "mdev type" + "aggregator count".
for an NVMe VF connecting to a remote storage. it could be "PCI ID" + "driver version" + "configured remote storage URL"
for a QAT VF, it may be "PCI ID" + "driver version" + "supported encryption set".
(to avoid namespace confliction from each vendor, we may prefix a driver name to each migration_version string. e.g. i915-v1-8086-591d-i915-GVTg_V5_8-1)
(2) backgrounds
The reason we hope the migration_version string is opaque to the userspace is that it is hard to generalize standard comparing fields and comparing methods for different devices from different vendors. Though userspace now could still do a simple string compare to check if two devices are compatible, and result should also be right, it's still too limited as it excludes the possible candidate whose migration_version string fails to be equal. e.g. an MDEV with mdev_type_1, aggregator count 3 is probably compatible with another MDEV with mdev_type_3, aggregator count 1, even their migration_version strings are not equal. (assumed mdev_type_3 is of 3 times equal resources of mdev_type_1).
besides that, driver version + configured resources are all elements demanding to take into account.
So, we hope leaving the freedom to vendor driver and let it make the final decision in a simple reading from source side and writing for test in the target side way.
we then think the device compatibility issues for live migration with assigned devices can be divided into two steps: a. management tools filter out possible migration target devices. Tags could be created according to info from product specification. we think openstack/ovirt may have vendor proprietary components to create those customized tags for each product from each vendor.
for Intel vGPU, with a vGPU(a MDEV device) in source side, the tags to search target vGPU are like: a tag for compatible parent PCI IDs, a tag for a range of gvt driver versions, a tag for a range of mdev type + aggregator count
for NVMe VF, the tags to search target VF may be like: a tag for compatible PCI IDs, a tag for a range of driver versions, a tag for URL of configured remote storage.
Requiring management application developers to figure out this possible compatibility based on prod specs is really unrealistic. Product specs are typically as clear as mud, and with the suggestion we consider different rules for different types of devices, add up to a huge amount of complexity. This isn't something app developers should have to spend their time figuring out.
The suggestion that we make use of vendor proprietary helper components is totally unacceptable. We need to be able to build a solution that works with exclusively an open source software stack.
IMHO there needs to be a mechanism for the kernel to report via sysfs what versions are supported on a given device. This puts the job of reporting compatible versions directly under the responsibility of the vendor who writes the kernel driver for it. They are the ones with the best knowledge of the hardware they've built and the rules around its compatibility.
b. with the output from step a, openstack/ovirt/libvirt could use our proposed device migration compatibility interface to make sure the two devices are indeed live migration compatible before launching the real live migration process to start stream copying, src device stopping and target device resuming. It is supposed that this step would not bring any performance penalty as -in kernel it's just a simple string decoding and comparing -in openstack/ovirt, it could be done by extending current function check_can_live_migrate_destination, along side claiming target resources.[1]
[1] https://specs.openstack.org/openstack/nova-specs/specs/stein/approved/libvir...
Thanks Yan
Regards, Daniel
On Tue, 2020-07-14 at 11:21 +0100, Daniel P. Berrangé wrote:
On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote:
hi folks, we are defining a device migration compatibility interface that helps upper layer stack like openstack/ovirt/libvirt to check if two devices are live migration compatible. The "devices" here could be MDEVs, physical devices, or hybrid of the two. e.g. we could use it to check whether
- a src MDEV can migrate to a target MDEV,
mdev live migration is completely possible to do but i agree with Dan barrange's comments from the point of view of openstack integration i dont see calling out to a vender sepecific tool to be an accpetable solutions for device compatiablity checking. the sys filesystem that describs the mdevs that can be created shoudl also contain the relevent infomation such taht nova could integrate it via libvirt xml representation or directly retrive the info from sysfs.
- a src VF in SRIOV can migrate to a target VF in SRIOV,
so vf to vf migration is not possible in the general case as there is no standarised way to transfer teh device state as part of the siorv specs produced by the pci-sig as such there is not vender neutral way to support sriov live migration.
- a src MDEV can migration to a target VF in SRIOV.
that also makes this unviable
(e.g. SIOV/SRIOV backward compatibility case)
The upper layer stack could use this interface as the last step to check if one device is able to migrate to another device before triggering a real live migration procedure.
well actully that is already too late really. ideally we would want to do this compaiablity check much sooneer to avoid the migration failing. in an openstack envionment at least by the time we invoke libvirt (assuming your using the libvirt driver) to do the migration we have alreaedy finished schduling the instance to the new host. if if we do the compatiablity check at this point and it fails then the live migration is aborted and will not be retired. These types of late check lead to a poor user experince as unless you check the migration detial it basically looks like the migration was ignored as it start to migrate and then continuge running on the orgininal host.
when using generic pci passhotuhg with openstack, the pci alias is intended to reference a single vendor id/product id so you will have 1+ alias for each type of device. that allows openstack to schedule based on the availability of a compatibale device because we track inventories of pci devices and can query that when selecting a host.
if we were to support mdev live migration in the future we would want to take the same declarative approch. 1 interospec the capability of the deivce we manage 2 create inventories of the allocatable devices and there capabilities 3 schdule the instance to a host based on the device-type/capabilities and claim it atomicly to prevent raceces 4 have the lower level hyperviors do addtional validation if need prelive migration.
this proposal seams to be targeting extending step 4 where as ideally we should focuse on providing the info that would be relevant in set 1 preferably in a vendor neutral way vai a kernel interface like /sys.
we are not sure if this interface is of value or help to you. please don't hesitate to drop your valuable comments.
(1) interface definition The interface is defined in below way:
__ userspace /\ \ / \write / read \
________/__________ ___|/_____________ | migration_version | | migration_version |-->check migration --------------------- --------------------- compatibility device A device B
a device attribute named migration_version is defined under each device's sysfs node. e.g. (/sys/bus/pci/devices/0000:00:02.0/$mdev_UUID/migration_version).
this might be useful as we could tag the inventory with the migration version and only might to devices with the same version
userspace tools read the migration_version as a string from the source device, and write it to the migration_version sysfs attribute in the target device.
this would not be useful as the schduler cannot directlly connect to the compute host and even if it could it would be extreamly slow to do this for 1000s of hosts and potentally multiple devices per host.
The userspace should treat ANY of below conditions as two devices not compatible:
- any one of the two devices does not have a migration_version attribute
- error when reading from migration_version attribute of one device
- error when writing migration_version string of one device to migration_version attribute of the other device
The string read from migration_version attribute is defined by device vendor driver and is completely opaque to the userspace.
opaque vendor specific stings that higher level orchestros have to pass form host to host and cant reason about are evil, when allowed they prolifroate and makes any idea of a vendor nutral abstraction and interoperablity between systems impossible to reason about. that said there is a way to make it opaue but still useful to userspace. see below
for a Intel vGPU, string format can be defined like "parent device PCI ID" + "version of gvt driver" + "mdev type" + "aggregator count".
for an NVMe VF connecting to a remote storage. it could be "PCI ID" + "driver version" + "configured remote storage URL"
for a QAT VF, it may be "PCI ID" + "driver version" + "supported encryption set".
(to avoid namespace confliction from each vendor, we may prefix a driver name to each migration_version string. e.g. i915-v1-8086-591d-i915-GVTg_V5_8-1)
honestly i would much prefer if the version string was just a semver string. e.g. {major}.{minor}.{bugfix}
if you do a driver/frimware update and break compatiablity with an older version bump the major version.
if you add optional a feature that does not break backwards compatiablity if you migrate an older instance to the new host then just bump the minor/feature number.
if you have a fix for a bug that does not change the feature set or compatiblity backwards or forwards then bump the bugfix number
then the check is as simple as 1.) is the mdev type the same 2.) is the major verion the same 3.) am i going form the same version to same version or same version to newer version
if all 3 are true we can migrate. e.g. 2.0.1 -> 2.1.1 (ok same major version and migrating from older feature release to newer feature release) 2.1.1 -> 2.0.1 (not ok same major version and migrating from new feature release to old feature release may be incompatable) 2.0.0 -> 3.0.0 (not ok chaning major version) 2.0.1 -> 2.0.0 (ok same major and minor version, all bugfixs in the same minor release should be compatibly)
we dont need vendor to rencode the driver name or vendor id and product id in the string. that info is alreay available both to the device driver and to userspace via /sys already we just need to know if version of the same mdev are compatiable so a simple semver version string which is well know in the software world at least is a clean abstration we can reuse.
(2) backgrounds
The reason we hope the migration_version string is opaque to the userspace is that it is hard to generalize standard comparing fields and comparing methods for different devices from different vendors. Though userspace now could still do a simple string compare to check if two devices are compatible, and result should also be right, it's still too limited as it excludes the possible candidate whose migration_version string fails to be equal. e.g. an MDEV with mdev_type_1, aggregator count 3 is probably compatible with another MDEV with mdev_type_3, aggregator count 1, even their migration_version strings are not equal. (assumed mdev_type_3 is of 3 times equal resources of mdev_type_1).
besides that, driver version + configured resources are all elements demanding to take into account.
So, we hope leaving the freedom to vendor driver and let it make the final decision in a simple reading from source side and writing for test in the target side way.
we then think the device compatibility issues for live migration with assigned devices can be divided into two steps: a. management tools filter out possible migration target devices. Tags could be created according to info from product specification. we think openstack/ovirt may have vendor proprietary components to create those customized tags for each product from each vendor. for Intel vGPU, with a vGPU(a MDEV device) in source side, the tags to search target vGPU are like: a tag for compatible parent PCI IDs, a tag for a range of gvt driver versions, a tag for a range of mdev type + aggregator count
for NVMe VF, the tags to search target VF may be like: a tag for compatible PCI IDs, a tag for a range of driver versions, a tag for URL of configured remote storage.
Requiring management application developers to figure out this possible compatibility based on prod specs is really unrealistic. Product specs are typically as clear as mud, and with the suggestion we consider different rules for different types of devices, add up to a huge amount of complexity. This isn't something app developers should have to spend their time figuring out.
The suggestion that we make use of vendor proprietary helper components is totally unacceptable. We need to be able to build a solution that works with exclusively an open source software stack.
IMHO there needs to be a mechanism for the kernel to report via sysfs what versions are supported on a given device. This puts the job of reporting compatible versions directly under the responsibility of the vendor who writes the kernel driver for it. They are the ones with the best knowledge of the hardware they've built and the rules around its compatibility.
yep totally agree with that statement.
b. with the output from step a, openstack/ovirt/libvirt could use our proposed device migration compatibility interface to make sure the two devices are indeed live migration compatible before launching the real live migration process to start stream copying, src device stopping and target device resuming. It is supposed that this step would not bring any performance penalty as -in kernel it's just a simple string decoding and comparing -in openstack/ovirt, it could be done by extending current function check_can_live_migrate_destination, along side claiming target resources.[1]
that is a compute driver fucntion https://github.com/openstack/nova/blob/8988316b8c132c9662dea6cf0345975e87ce7... that is called in the conductor here
https://github.com/openstack/nova/blob/8988316b8c132c9662dea6cf0345975e87ce7... if the check fails(ignoreing the fact its expensive to do an rpc to the compute host) we raise an excption that move on to the next host in the alternate host list.
https://github.com/openstack/nova/blob/8988316b8c132c9662dea6cf0345975e87ce7... by default the alternate host list is 3 https://docs.openstack.org/nova/latest/configuration/config.html#scheduler.m... so there would be a pretty high likely hood that if we only checked compatiablity at this point it would fail to migrate. realistically speaking this is too late. we can do a final safty check at this point but this should not be the first time we check compatibility. at a mimnium we would have wanted to select a host with the same mdev type first, we can do that from the info we have today but i hope i have made the point that declaritive interfacs which we can introspect without haveing opaqce vender sepecitic blob are vastly more consomable then imperitive interfaces we have to probe. form a security and packaging point of view this is better too as if i only need readonly access to sysfs instead of write access and if i dont need to package a bunch of addtion vendor tools in a continerised deployment that significantly decreases the potential attack surface.
[1] https://specs.openstack.org/openstack/nova-specs/specs/stein/approved/libvir...
Thanks Yan
Regards, Daniel
On Tue, 14 Jul 2020 13:33:24 +0100 Sean Mooney smooney@redhat.com wrote:
On Tue, 2020-07-14 at 11:21 +0100, Daniel P. Berrangé wrote:
On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote:
hi folks, we are defining a device migration compatibility interface that helps upper layer stack like openstack/ovirt/libvirt to check if two devices are live migration compatible. The "devices" here could be MDEVs, physical devices, or hybrid of the two. e.g. we could use it to check whether
- a src MDEV can migrate to a target MDEV,
mdev live migration is completely possible to do but i agree with Dan barrange's comments from the point of view of openstack integration i dont see calling out to a vender sepecific tool to be an accpetable
As I replied to Dan, I'm hoping Yan was referring more to vendor specific knowledge rather than actual tools.
solutions for device compatiablity checking. the sys filesystem that describs the mdevs that can be created shoudl also contain the relevent infomation such taht nova could integrate it via libvirt xml representation or directly retrive the info from sysfs.
- a src VF in SRIOV can migrate to a target VF in SRIOV,
so vf to vf migration is not possible in the general case as there is no standarised way to transfer teh device state as part of the siorv specs produced by the pci-sig as such there is not vender neutral way to support sriov live migration.
We're not talking about a general case, we're talking about physical devices which have vfio wrappers or hooks with device specific knowledge in order to support the vfio migration interface. The point is that a discussion around vfio device migration cannot be limited to mdev devices.
- a src MDEV can migration to a target VF in SRIOV.
that also makes this unviable
(e.g. SIOV/SRIOV backward compatibility case)
The upper layer stack could use this interface as the last step to check if one device is able to migrate to another device before triggering a real live migration procedure.
well actully that is already too late really. ideally we would want to do this compaiablity check much sooneer to avoid the migration failing. in an openstack envionment at least by the time we invoke libvirt (assuming your using the libvirt driver) to do the migration we have alreaedy finished schduling the instance to the new host. if if we do the compatiablity check at this point and it fails then the live migration is aborted and will not be retired. These types of late check lead to a poor user experince as unless you check the migration detial it basically looks like the migration was ignored as it start to migrate and then continuge running on the orgininal host.
when using generic pci passhotuhg with openstack, the pci alias is intended to reference a single vendor id/product id so you will have 1+ alias for each type of device. that allows openstack to schedule based on the availability of a compatibale device because we track inventories of pci devices and can query that when selecting a host.
if we were to support mdev live migration in the future we would want to take the same declarative approch. 1 interospec the capability of the deivce we manage 2 create inventories of the allocatable devices and there capabilities 3 schdule the instance to a host based on the device-type/capabilities and claim it atomicly to prevent raceces 4 have the lower level hyperviors do addtional validation if need prelive migration.
this proposal seams to be targeting extending step 4 where as ideally we should focuse on providing the info that would be relevant in set 1 preferably in a vendor neutral way vai a kernel interface like /sys.
I think this is reading a whole lot into the phrase "last step". We want to make the information available for a management engine to consume as needed to make informed decisions regarding likely compatible target devices.
we are not sure if this interface is of value or help to you. please don't hesitate to drop your valuable comments.
(1) interface definition The interface is defined in below way:
__ userspace /\ \ / \write / read \
________/__________ ___|/_____________ | migration_version | | migration_version |-->check migration --------------------- --------------------- compatibility device A device B
a device attribute named migration_version is defined under each device's sysfs node. e.g. (/sys/bus/pci/devices/0000:00:02.0/$mdev_UUID/migration_version).
this might be useful as we could tag the inventory with the migration version and only might to devices with the same version
Is cross version compatibility something that you'd consider using?
userspace tools read the migration_version as a string from the source device, and write it to the migration_version sysfs attribute in the target device.
this would not be useful as the schduler cannot directlly connect to the compute host and even if it could it would be extreamly slow to do this for 1000s of hosts and potentally multiple devices per host.
Seems similar to Dan's requirement, looks like the 'read for version, write for compatibility' test idea isn't really viable.
The userspace should treat ANY of below conditions as two devices not compatible:
- any one of the two devices does not have a migration_version attribute
- error when reading from migration_version attribute of one device
- error when writing migration_version string of one device to migration_version attribute of the other device
The string read from migration_version attribute is defined by device vendor driver and is completely opaque to the userspace.
opaque vendor specific stings that higher level orchestros have to pass form host to host and cant reason about are evil, when allowed they prolifroate and makes any idea of a vendor nutral abstraction and interoperablity between systems impossible to reason about. that said there is a way to make it opaue but still useful to userspace. see below
for a Intel vGPU, string format can be defined like "parent device PCI ID" + "version of gvt driver" + "mdev type" + "aggregator count".
for an NVMe VF connecting to a remote storage. it could be "PCI ID" + "driver version" + "configured remote storage URL"
for a QAT VF, it may be "PCI ID" + "driver version" + "supported encryption set".
(to avoid namespace confliction from each vendor, we may prefix a driver name to each migration_version string. e.g. i915-v1-8086-591d-i915-GVTg_V5_8-1)
honestly i would much prefer if the version string was just a semver string. e.g. {major}.{minor}.{bugfix}
if you do a driver/frimware update and break compatiablity with an older version bump the major version.
if you add optional a feature that does not break backwards compatiablity if you migrate an older instance to the new host then just bump the minor/feature number.
if you have a fix for a bug that does not change the feature set or compatiblity backwards or forwards then bump the bugfix number
then the check is as simple as 1.) is the mdev type the same 2.) is the major verion the same 3.) am i going form the same version to same version or same version to newer version
if all 3 are true we can migrate. e.g. 2.0.1 -> 2.1.1 (ok same major version and migrating from older feature release to newer feature release) 2.1.1 -> 2.0.1 (not ok same major version and migrating from new feature release to old feature release may be incompatable) 2.0.0 -> 3.0.0 (not ok chaning major version) 2.0.1 -> 2.0.0 (ok same major and minor version, all bugfixs in the same minor release should be compatibly)
What's the value of the bugfix field in this scheme?
The simplicity is good, but is it too simple. It's not immediately clear to me whether all features can be hidden behind a minor version. For instance, if we have an mdev device that supports this notion of aggregation, which is proposed as a solution to the problem that physical hardware might support lots and lots of assignable interfaces which can be combined into arbitrary sets for mdev devices, making it impractical to expose an mdev type for every possible enumeration of assignable interfaces within a device. We therefore expose a base type where the aggregation is built later. This essentially puts us in a scenario where even within an mdev type running on the same driver, there are devices that are not directly compatible with each other.
we dont need vendor to rencode the driver name or vendor id and product id in the string. that info is alreay available both to the device driver and to userspace via /sys already we just need to know if version of the same mdev are compatiable so a simple semver version string which is well know in the software world at least is a clean abstration we can reuse.
This presumes there's no cross device migration. An mdev type can only be migrated to the same mdev type, all of the devices within that type have some based compatibility, a phsyical device can only be migrated to the same physical device. In the latter case what defines the type? If it's a PCI device, is it only vendor:device IDs? What about revision? What about subsystem IDs? What about possibly an onboard ROM or internal firmware? The information may be available, but which things are relevant to migration? We already see desires to allow migration between physical and mdev, but also to expose mdev types that might be composable to be compatible with other types. Thanks,
Alex
On Tue, 14 Jul 2020 11:21:29 +0100 Daniel P. Berrangé berrange@redhat.com wrote:
On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote:
hi folks, we are defining a device migration compatibility interface that helps upper layer stack like openstack/ovirt/libvirt to check if two devices are live migration compatible. The "devices" here could be MDEVs, physical devices, or hybrid of the two. e.g. we could use it to check whether
- a src MDEV can migrate to a target MDEV,
- a src VF in SRIOV can migrate to a target VF in SRIOV,
- a src MDEV can migration to a target VF in SRIOV. (e.g. SIOV/SRIOV backward compatibility case)
The upper layer stack could use this interface as the last step to check if one device is able to migrate to another device before triggering a real live migration procedure. we are not sure if this interface is of value or help to you. please don't hesitate to drop your valuable comments.
(1) interface definition The interface is defined in below way:
__ userspace /\ \ / \write / read \
________/__________ ___|/_____________ | migration_version | | migration_version |-->check migration --------------------- --------------------- compatibility device A device B
a device attribute named migration_version is defined under each device's sysfs node. e.g. (/sys/bus/pci/devices/0000:00:02.0/$mdev_UUID/migration_version). userspace tools read the migration_version as a string from the source device, and write it to the migration_version sysfs attribute in the target device.
The userspace should treat ANY of below conditions as two devices not compatible:
- any one of the two devices does not have a migration_version attribute
- error when reading from migration_version attribute of one device
- error when writing migration_version string of one device to migration_version attribute of the other device
The string read from migration_version attribute is defined by device vendor driver and is completely opaque to the userspace. for a Intel vGPU, string format can be defined like "parent device PCI ID" + "version of gvt driver" + "mdev type" + "aggregator count".
for an NVMe VF connecting to a remote storage. it could be "PCI ID" + "driver version" + "configured remote storage URL"
for a QAT VF, it may be "PCI ID" + "driver version" + "supported encryption set".
(to avoid namespace confliction from each vendor, we may prefix a driver name to each migration_version string. e.g. i915-v1-8086-591d-i915-GVTg_V5_8-1)
It's very strange to define it as opaque and then proceed to describe the contents of that opaque string. The point is that its contents are defined by the vendor driver to describe the device, driver version, and possibly metadata about the configuration of the device. One instance of a device might generate a different string from another. The string that a device produces is not necessarily the only string the vendor driver will accept, for example the driver might support backwards compatible migrations.
(2) backgrounds
The reason we hope the migration_version string is opaque to the userspace is that it is hard to generalize standard comparing fields and comparing methods for different devices from different vendors. Though userspace now could still do a simple string compare to check if two devices are compatible, and result should also be right, it's still too limited as it excludes the possible candidate whose migration_version string fails to be equal. e.g. an MDEV with mdev_type_1, aggregator count 3 is probably compatible with another MDEV with mdev_type_3, aggregator count 1, even their migration_version strings are not equal. (assumed mdev_type_3 is of 3 times equal resources of mdev_type_1).
besides that, driver version + configured resources are all elements demanding to take into account.
So, we hope leaving the freedom to vendor driver and let it make the final decision in a simple reading from source side and writing for test in the target side way.
we then think the device compatibility issues for live migration with assigned devices can be divided into two steps: a. management tools filter out possible migration target devices. Tags could be created according to info from product specification. we think openstack/ovirt may have vendor proprietary components to create those customized tags for each product from each vendor.
for Intel vGPU, with a vGPU(a MDEV device) in source side, the tags to search target vGPU are like: a tag for compatible parent PCI IDs, a tag for a range of gvt driver versions, a tag for a range of mdev type + aggregator count
for NVMe VF, the tags to search target VF may be like: a tag for compatible PCI IDs, a tag for a range of driver versions, a tag for URL of configured remote storage.
I interpret this as hand waving, ie. the first step is for management tools to make a good guess :-\ We don't seem to be willing to say that a given mdev type can only migrate to a device with that same type. There's this aggregation discussion happening separately where a base mdev type might be created or later configured to be equivalent to a different type. The vfio migration API we've defined is also not limited to mdev devices, for example we could create vendor specific quirks or hooks to provide migration support for a physical PF/VF device. Within the realm of possibility then is that we could migrate between a physical device and an mdev device, which are simply different degrees of creating a virtualization layer in front of the device.
Requiring management application developers to figure out this possible compatibility based on prod specs is really unrealistic. Product specs are typically as clear as mud, and with the suggestion we consider different rules for different types of devices, add up to a huge amount of complexity. This isn't something app developers should have to spend their time figuring out.
Agreed.
The suggestion that we make use of vendor proprietary helper components is totally unacceptable. We need to be able to build a solution that works with exclusively an open source software stack.
I'm surprised to see this as well, but I'm not sure if Yan was really suggesting proprietary software so much as just vendor specific knowledge.
IMHO there needs to be a mechanism for the kernel to report via sysfs what versions are supported on a given device. This puts the job of reporting compatible versions directly under the responsibility of the vendor who writes the kernel driver for it. They are the ones with the best knowledge of the hardware they've built and the rules around its compatibility.
The version string discussed previously is the version string that represents a given device, possibly including driver information, configuration, etc. I think what you're asking for here is an enumeration of every possible version string that a given device could accept as an incoming migration stream. If we consider the string as opaque, that means the vendor driver needs to generate a separate string for every possible version it could accept, for every possible configuration option. That potentially becomes an excessive amount of data to either generate or manage.
Am I overestimating how vendors intend to use the version string?
We'd also need to consider devices that we could create, for instance providing the same interface enumeration prior to creating an mdev device to have a confidence level that the new device would be a valid target.
We defined the string as opaque to allow vendor flexibility and because defining a common format is hard. Do we need to revisit this part of the discussion to define the version string as non-opaque with parsing rules, probably with separate incoming vs outgoing interfaces? Thanks,
Alex
On Tue, Jul 14, 2020 at 10:16:16AM -0600, Alex Williamson wrote:
On Tue, 14 Jul 2020 11:21:29 +0100 Daniel P. Berrangé berrange@redhat.com wrote:
On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote:
The string read from migration_version attribute is defined by device vendor driver and is completely opaque to the userspace. for a Intel vGPU, string format can be defined like "parent device PCI ID" + "version of gvt driver" + "mdev type" + "aggregator count".
for an NVMe VF connecting to a remote storage. it could be "PCI ID" + "driver version" + "configured remote storage URL"
for a QAT VF, it may be "PCI ID" + "driver version" + "supported encryption set".
(to avoid namespace confliction from each vendor, we may prefix a driver name to each migration_version string. e.g. i915-v1-8086-591d-i915-GVTg_V5_8-1)
It's very strange to define it as opaque and then proceed to describe the contents of that opaque string. The point is that its contents are defined by the vendor driver to describe the device, driver version, and possibly metadata about the configuration of the device. One instance of a device might generate a different string from another. The string that a device produces is not necessarily the only string the vendor driver will accept, for example the driver might support backwards compatible migrations.
IMHO there needs to be a mechanism for the kernel to report via sysfs what versions are supported on a given device. This puts the job of reporting compatible versions directly under the responsibility of the vendor who writes the kernel driver for it. They are the ones with the best knowledge of the hardware they've built and the rules around its compatibility.
The version string discussed previously is the version string that represents a given device, possibly including driver information, configuration, etc. I think what you're asking for here is an enumeration of every possible version string that a given device could accept as an incoming migration stream. If we consider the string as opaque, that means the vendor driver needs to generate a separate string for every possible version it could accept, for every possible configuration option. That potentially becomes an excessive amount of data to either generate or manage.
Am I overestimating how vendors intend to use the version string?
If I'm interpreting your reply & the quoted text orrectly, the version string isn't really a version string in any normal sense of the word "version".
Instead it sounds like string encoding a set of features in some arbitrary vendor specific format, which they parse and do compatibility checks on individual pieces ? One or more parts may contain a version number, but its much more than just a version.
If that's correct, then I'd prefer we didn't call it a version string, instead call it a "capability string" to make it clear it is expressing a much more general concept, but...
We'd also need to consider devices that we could create, for instance providing the same interface enumeration prior to creating an mdev device to have a confidence level that the new device would be a valid target.
We defined the string as opaque to allow vendor flexibility and because defining a common format is hard. Do we need to revisit this part of the discussion to define the version string as non-opaque with parsing rules, probably with separate incoming vs outgoing interfaces? Thanks,
..even if the huge amount of flexibility is technically relevant from the POV of the hardware/drivers, we should consider whether management apps actually want, or can use, that level of flexibility.
The task of picking which host to place a VM on has alot of factors to consider, and when there are a large number of hosts, the total amount of information to check gets correspondingly large. The placement process is also fairly performance critical.
Running complex algorithmic logic to check compatibility of devices based on a arbitrary set of rules is likely to be a performance challenge. A flat list of supported strings is a much simpler thing to check as it reduces down to a simple set membership test.
IOW, even if there's some complex set of device type / vendor specific rules to check for compatibility, I fear apps will ignore them and just define a very simplified list of compatible string, and ignore all the extra flexibility.
I'm sure OpenStack maintainers can speak to this more, as they've put alot of work into their scheduling engine to optimize the way it places VMs largely driven from simple structured data reported from hosts.
Regards, Daniel
On Tue, 14 Jul 2020 17:47:22 +0100 Daniel P. Berrangé berrange@redhat.com wrote:
On Tue, Jul 14, 2020 at 10:16:16AM -0600, Alex Williamson wrote:
On Tue, 14 Jul 2020 11:21:29 +0100 Daniel P. Berrangé berrange@redhat.com wrote:
On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote:
The string read from migration_version attribute is defined by device vendor driver and is completely opaque to the userspace. for a Intel vGPU, string format can be defined like "parent device PCI ID" + "version of gvt driver" + "mdev type" + "aggregator count".
for an NVMe VF connecting to a remote storage. it could be "PCI ID" + "driver version" + "configured remote storage URL"
for a QAT VF, it may be "PCI ID" + "driver version" + "supported encryption set".
(to avoid namespace confliction from each vendor, we may prefix a driver name to each migration_version string. e.g. i915-v1-8086-591d-i915-GVTg_V5_8-1)
It's very strange to define it as opaque and then proceed to describe the contents of that opaque string. The point is that its contents are defined by the vendor driver to describe the device, driver version, and possibly metadata about the configuration of the device. One instance of a device might generate a different string from another. The string that a device produces is not necessarily the only string the vendor driver will accept, for example the driver might support backwards compatible migrations.
IMHO there needs to be a mechanism for the kernel to report via sysfs what versions are supported on a given device. This puts the job of reporting compatible versions directly under the responsibility of the vendor who writes the kernel driver for it. They are the ones with the best knowledge of the hardware they've built and the rules around its compatibility.
The version string discussed previously is the version string that represents a given device, possibly including driver information, configuration, etc. I think what you're asking for here is an enumeration of every possible version string that a given device could accept as an incoming migration stream. If we consider the string as opaque, that means the vendor driver needs to generate a separate string for every possible version it could accept, for every possible configuration option. That potentially becomes an excessive amount of data to either generate or manage.
Am I overestimating how vendors intend to use the version string?
If I'm interpreting your reply & the quoted text orrectly, the version string isn't really a version string in any normal sense of the word "version".
Instead it sounds like string encoding a set of features in some arbitrary vendor specific format, which they parse and do compatibility checks on individual pieces ? One or more parts may contain a version number, but its much more than just a version.
If that's correct, then I'd prefer we didn't call it a version string, instead call it a "capability string" to make it clear it is expressing a much more general concept, but...
I'd agree with that. The intent of the previous proposal was to provide and interface for reading a string and writing a string back in where the result of that write indicated migration compatibility with the device. So yes, "version" is not the right term.
We'd also need to consider devices that we could create, for instance providing the same interface enumeration prior to creating an mdev device to have a confidence level that the new device would be a valid target.
We defined the string as opaque to allow vendor flexibility and because defining a common format is hard. Do we need to revisit this part of the discussion to define the version string as non-opaque with parsing rules, probably with separate incoming vs outgoing interfaces? Thanks,
..even if the huge amount of flexibility is technically relevant from the POV of the hardware/drivers, we should consider whether management apps actually want, or can use, that level of flexibility.
The task of picking which host to place a VM on has alot of factors to consider, and when there are a large number of hosts, the total amount of information to check gets correspondingly large. The placement process is also fairly performance critical.
Running complex algorithmic logic to check compatibility of devices based on a arbitrary set of rules is likely to be a performance challenge. A flat list of supported strings is a much simpler thing to check as it reduces down to a simple set membership test.
IOW, even if there's some complex set of device type / vendor specific rules to check for compatibility, I fear apps will ignore them and just define a very simplified list of compatible string, and ignore all the extra flexibility.
There's always the "try it and see if it works" interface, which is essentially what we have currently. With even a simple version of what we're trying to accomplish here, there's still a risk that a management engine might rather just ignore it and restrict themselves to 1:1 mdev type matches, with or without knowing anything about the vendor driver version, relying on the migration to fail quickly if the devices are incompatible. If the complexity of the interface makes it too complicated or time consuming to provide sufficient value above such an algorithm, there's not much point to implementing it, which is why Yan has included so many people in this discussion.
I'm sure OpenStack maintainers can speak to this more, as they've put alot of work into their scheduling engine to optimize the way it places VMs largely driven from simple structured data reported from hosts.
I think we've weeded out that our intended approach is not worthwhile, testing a compatibility string at a device is too much overhead, we need to provide enough information to the management engine to predict the response without interaction beyond the initial capability probing.
As you've identified above, we're really dealing with more than a simple version, we need to construct a compatibility string and we need to start defining what goes into that.
The first item seems to be that we're defining compatibility relative to a vfio migration stream, vfio devices have a device API, such as vfio-pci, so the first attribute might simply define the device API. Once we have a class of devices we might then be able to use bus specific attributes, for example the PCI vendor and device ID (other bus types TBD).
We probably also need driver version numbers, so we need to include both the driver name as well as version major and minor numbers. Rules need to be put in place around what we consider to be viable version matches, potentially as Sean described. For example, does the major version require a match? Do we restrict to only formward, ie. increasing, minor number matches within that major verison?
Do we then also have section that includes any required device attributes to result in a compatible device. This would be largely focused on mdev, but I wouldn't rule out others. For example if an aggregation parameter is required to maintain compatibility, we'd want to specify that as a required attribute.
So maybe we end up with something like:
{ "device_api": "vfio-pci", "vendor": "vendor-driver-name", "version": { "major": 0, "minor": 1 }, "vfio-pci": { // Based on above device_api "vendor": 0x1234, // Values for the exposed device "device": 0x5678, // Possibly further parameters for a more specific match } "mdev_attrs": [ { "attribute0": "VALUE" } ] }
The sysfs interface would return an array containing one or more of these for each device supported. I'm trying to account for things like aggregation via the mdev_attrs section, but I haven't really put it all together yet. I think Intel folks want to be able to say mdev type foo-3 is compatible with mdev type foo-1 so long as foo-1 is created with an aggregation attribute value of 3, but I expect both foo-1 and foo-3 would have the same user visible PCI vendor:device IDs If we use mdev type rather than the resulting device IDs, then we introduce an barrier to phys<->mdev migration. We could specify the subsystem values though, for example foo-1 might correspond to subsystem IDs 8086:0001 and foo3 8086:0003, then we can specify that creating an foo-1 from this device doesn't require any attributes, but creating a foo-3 does. I'm nervous how that scales though.
NB. I'm also considering how portions of this might be compatible with mdevctl such that we could direct mdevctl to create a compatible device using information from this compatibility interface.
Thanks, Alex
On Tue, Jul 14, 2020 at 02:47:15PM -0600, Alex Williamson wrote:
On Tue, 14 Jul 2020 17:47:22 +0100 Daniel P. Berrangé berrange@redhat.com wrote:
I'm sure OpenStack maintainers can speak to this more, as they've put alot of work into their scheduling engine to optimize the way it places VMs largely driven from simple structured data reported from hosts.
I think we've weeded out that our intended approach is not worthwhile, testing a compatibility string at a device is too much overhead, we need to provide enough information to the management engine to predict the response without interaction beyond the initial capability probing.
Just to clarify in case people mis-interpreted my POV...
I think that testing a compatibility string at a device *is* useful, as it allows for a final accurate safety check to be performed before the migration stream starts. Libvirt could use that reasonably easily I believe.
It just isn't sufficient for a complete solution.
In parallel with the device level test in sysfs, we need something else to support the host placement selection problems in an efficient way, as you are trying to address in the remainder of your mail.
Regards, Daniel
* Alex Williamson (alex.williamson@redhat.com) wrote:
On Tue, 14 Jul 2020 11:21:29 +0100 Daniel P. Berrangé berrange@redhat.com wrote:
On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote:
hi folks, we are defining a device migration compatibility interface that helps upper layer stack like openstack/ovirt/libvirt to check if two devices are live migration compatible. The "devices" here could be MDEVs, physical devices, or hybrid of the two. e.g. we could use it to check whether
- a src MDEV can migrate to a target MDEV,
- a src VF in SRIOV can migrate to a target VF in SRIOV,
- a src MDEV can migration to a target VF in SRIOV. (e.g. SIOV/SRIOV backward compatibility case)
The upper layer stack could use this interface as the last step to check if one device is able to migrate to another device before triggering a real live migration procedure. we are not sure if this interface is of value or help to you. please don't hesitate to drop your valuable comments.
(1) interface definition The interface is defined in below way:
__ userspace /\ \ / \write / read \
________/__________ ___|/_____________ | migration_version | | migration_version |-->check migration --------------------- --------------------- compatibility device A device B
a device attribute named migration_version is defined under each device's sysfs node. e.g. (/sys/bus/pci/devices/0000:00:02.0/$mdev_UUID/migration_version). userspace tools read the migration_version as a string from the source device, and write it to the migration_version sysfs attribute in the target device.
The userspace should treat ANY of below conditions as two devices not compatible:
- any one of the two devices does not have a migration_version attribute
- error when reading from migration_version attribute of one device
- error when writing migration_version string of one device to migration_version attribute of the other device
The string read from migration_version attribute is defined by device vendor driver and is completely opaque to the userspace. for a Intel vGPU, string format can be defined like "parent device PCI ID" + "version of gvt driver" + "mdev type" + "aggregator count".
for an NVMe VF connecting to a remote storage. it could be "PCI ID" + "driver version" + "configured remote storage URL"
for a QAT VF, it may be "PCI ID" + "driver version" + "supported encryption set".
(to avoid namespace confliction from each vendor, we may prefix a driver name to each migration_version string. e.g. i915-v1-8086-591d-i915-GVTg_V5_8-1)
It's very strange to define it as opaque and then proceed to describe the contents of that opaque string. The point is that its contents are defined by the vendor driver to describe the device, driver version, and possibly metadata about the configuration of the device. One instance of a device might generate a different string from another. The string that a device produces is not necessarily the only string the vendor driver will accept, for example the driver might support backwards compatible migrations.
(As I've said in the previous discussion, off one of the patch series)
My view is it makes sense to have a half-way house on the opaqueness of this string; I'd expect to have an ID and version that are human readable, maybe a device ID/name that's human interpretable and then a bunch of other cruft that maybe device/vendor/version specific.
I'm thinking that we want to be able to report problems and include the string and the user to be able to easily identify the device that was complaining and notice a difference in versions, and perhaps also use it in compatibility patterns to find compatible hosts; but that does get tricky when it's a 'ask the device if it's compatible'.
Dave
(2) backgrounds
The reason we hope the migration_version string is opaque to the userspace is that it is hard to generalize standard comparing fields and comparing methods for different devices from different vendors. Though userspace now could still do a simple string compare to check if two devices are compatible, and result should also be right, it's still too limited as it excludes the possible candidate whose migration_version string fails to be equal. e.g. an MDEV with mdev_type_1, aggregator count 3 is probably compatible with another MDEV with mdev_type_3, aggregator count 1, even their migration_version strings are not equal. (assumed mdev_type_3 is of 3 times equal resources of mdev_type_1).
besides that, driver version + configured resources are all elements demanding to take into account.
So, we hope leaving the freedom to vendor driver and let it make the final decision in a simple reading from source side and writing for test in the target side way.
we then think the device compatibility issues for live migration with assigned devices can be divided into two steps: a. management tools filter out possible migration target devices. Tags could be created according to info from product specification. we think openstack/ovirt may have vendor proprietary components to create those customized tags for each product from each vendor.
for Intel vGPU, with a vGPU(a MDEV device) in source side, the tags to search target vGPU are like: a tag for compatible parent PCI IDs, a tag for a range of gvt driver versions, a tag for a range of mdev type + aggregator count
for NVMe VF, the tags to search target VF may be like: a tag for compatible PCI IDs, a tag for a range of driver versions, a tag for URL of configured remote storage.
I interpret this as hand waving, ie. the first step is for management tools to make a good guess :-\ We don't seem to be willing to say that a given mdev type can only migrate to a device with that same type. There's this aggregation discussion happening separately where a base mdev type might be created or later configured to be equivalent to a different type. The vfio migration API we've defined is also not limited to mdev devices, for example we could create vendor specific quirks or hooks to provide migration support for a physical PF/VF device. Within the realm of possibility then is that we could migrate between a physical device and an mdev device, which are simply different degrees of creating a virtualization layer in front of the device.
Requiring management application developers to figure out this possible compatibility based on prod specs is really unrealistic. Product specs are typically as clear as mud, and with the suggestion we consider different rules for different types of devices, add up to a huge amount of complexity. This isn't something app developers should have to spend their time figuring out.
Agreed.
The suggestion that we make use of vendor proprietary helper components is totally unacceptable. We need to be able to build a solution that works with exclusively an open source software stack.
I'm surprised to see this as well, but I'm not sure if Yan was really suggesting proprietary software so much as just vendor specific knowledge.
IMHO there needs to be a mechanism for the kernel to report via sysfs what versions are supported on a given device. This puts the job of reporting compatible versions directly under the responsibility of the vendor who writes the kernel driver for it. They are the ones with the best knowledge of the hardware they've built and the rules around its compatibility.
The version string discussed previously is the version string that represents a given device, possibly including driver information, configuration, etc. I think what you're asking for here is an enumeration of every possible version string that a given device could accept as an incoming migration stream. If we consider the string as opaque, that means the vendor driver needs to generate a separate string for every possible version it could accept, for every possible configuration option. That potentially becomes an excessive amount of data to either generate or manage.
Am I overestimating how vendors intend to use the version string?
We'd also need to consider devices that we could create, for instance providing the same interface enumeration prior to creating an mdev device to have a confidence level that the new device would be a valid target.
We defined the string as opaque to allow vendor flexibility and because defining a common format is hard. Do we need to revisit this part of the discussion to define the version string as non-opaque with parsing rules, probably with separate incoming vs outgoing interfaces? Thanks,
Alex
-- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
On Tue, 14 Jul 2020 18:19:46 +0100 "Dr. David Alan Gilbert" dgilbert@redhat.com wrote:
- Alex Williamson (alex.williamson@redhat.com) wrote:
On Tue, 14 Jul 2020 11:21:29 +0100 Daniel P. Berrangé berrange@redhat.com wrote:
On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote:
hi folks, we are defining a device migration compatibility interface that helps upper layer stack like openstack/ovirt/libvirt to check if two devices are live migration compatible. The "devices" here could be MDEVs, physical devices, or hybrid of the two. e.g. we could use it to check whether
- a src MDEV can migrate to a target MDEV,
- a src VF in SRIOV can migrate to a target VF in SRIOV,
- a src MDEV can migration to a target VF in SRIOV. (e.g. SIOV/SRIOV backward compatibility case)
The upper layer stack could use this interface as the last step to check if one device is able to migrate to another device before triggering a real live migration procedure. we are not sure if this interface is of value or help to you. please don't hesitate to drop your valuable comments.
(1) interface definition The interface is defined in below way:
__ userspace /\ \ / \write / read \
________/__________ ___|/_____________ | migration_version | | migration_version |-->check migration --------------------- --------------------- compatibility device A device B
a device attribute named migration_version is defined under each device's sysfs node. e.g. (/sys/bus/pci/devices/0000:00:02.0/$mdev_UUID/migration_version). userspace tools read the migration_version as a string from the source device, and write it to the migration_version sysfs attribute in the target device.
The userspace should treat ANY of below conditions as two devices not compatible:
- any one of the two devices does not have a migration_version attribute
- error when reading from migration_version attribute of one device
- error when writing migration_version string of one device to migration_version attribute of the other device
The string read from migration_version attribute is defined by device vendor driver and is completely opaque to the userspace. for a Intel vGPU, string format can be defined like "parent device PCI ID" + "version of gvt driver" + "mdev type" + "aggregator count".
for an NVMe VF connecting to a remote storage. it could be "PCI ID" + "driver version" + "configured remote storage URL"
for a QAT VF, it may be "PCI ID" + "driver version" + "supported encryption set".
(to avoid namespace confliction from each vendor, we may prefix a driver name to each migration_version string. e.g. i915-v1-8086-591d-i915-GVTg_V5_8-1)
It's very strange to define it as opaque and then proceed to describe the contents of that opaque string. The point is that its contents are defined by the vendor driver to describe the device, driver version, and possibly metadata about the configuration of the device. One instance of a device might generate a different string from another. The string that a device produces is not necessarily the only string the vendor driver will accept, for example the driver might support backwards compatible migrations.
(As I've said in the previous discussion, off one of the patch series)
My view is it makes sense to have a half-way house on the opaqueness of this string; I'd expect to have an ID and version that are human readable, maybe a device ID/name that's human interpretable and then a bunch of other cruft that maybe device/vendor/version specific.
I'm thinking that we want to be able to report problems and include the string and the user to be able to easily identify the device that was complaining and notice a difference in versions, and perhaps also use it in compatibility patterns to find compatible hosts; but that does get tricky when it's a 'ask the device if it's compatible'.
In the reply I just sent to Dan, I gave this example of what a "compatibility string" might look like represented as json:
{ "device_api": "vfio-pci", "vendor": "vendor-driver-name", "version": { "major": 0, "minor": 1 }, "vfio-pci": { // Based on above device_api "vendor": 0x1234, // Values for the exposed device "device": 0x5678, // Possibly further parameters for a more specific match }, "mdev_attrs": [ { "attribute0": "VALUE" } ] }
Are you thinking that we might allow the vendor to include a vendor specific array where we'd simply require that both sides have matching fields and values? ie.
"vendor_fields": [ { "unknown_field0": "unknown_value0" }, { "unknown_field1": "unknown_value1" }, ]
We could certainly make that part of the spec, but I can't really figure the value of it other than to severely restrict compatibility, which the vendor could already do via the version.major value. Maybe they'd want to put a build timestamp, random uuid, or source sha1 into such a field to make absolutely certain compatibility is only determined between identical builds? Thanks,
Alex
Alex Williamson alex.williamson@redhat.com 于2020年7月15日周三 上午5:00写道:
On Tue, 14 Jul 2020 18:19:46 +0100 "Dr. David Alan Gilbert" dgilbert@redhat.com wrote:
- Alex Williamson (alex.williamson@redhat.com) wrote:
On Tue, 14 Jul 2020 11:21:29 +0100 Daniel P. Berrangé berrange@redhat.com wrote:
On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote:
hi folks, we are defining a device migration compatibility interface that
helps upper
layer stack like openstack/ovirt/libvirt to check if two devices
are
live migration compatible. The "devices" here could be MDEVs, physical devices, or hybrid of
the two.
e.g. we could use it to check whether
- a src MDEV can migrate to a target MDEV,
- a src VF in SRIOV can migrate to a target VF in SRIOV,
- a src MDEV can migration to a target VF in SRIOV. (e.g. SIOV/SRIOV backward compatibility case)
The upper layer stack could use this interface as the last step to
check
if one device is able to migrate to another device before
triggering a real
live migration procedure. we are not sure if this interface is of value or help to you.
please don't
hesitate to drop your valuable comments.
(1) interface definition The interface is defined in below way:
__ userspace /\ \ / \write / read \
________/__________ ___|/_____________ | migration_version | | migration_version |-->check migration --------------------- --------------------- compatibility device A device B
a device attribute named migration_version is defined under each
device's
sysfs node. e.g.
(/sys/bus/pci/devices/0000:00:02.0/$mdev_UUID/migration_version).
userspace tools read the migration_version as a string from the
source device,
and write it to the migration_version sysfs attribute in the
target device.
The userspace should treat ANY of below conditions as two devices
not compatible:
- any one of the two devices does not have a migration_version
attribute
- error when reading from migration_version attribute of one device
- error when writing migration_version string of one device to migration_version attribute of the other device
The string read from migration_version attribute is defined by
device vendor
driver and is completely opaque to the userspace. for a Intel vGPU, string format can be defined like "parent device PCI ID" + "version of gvt driver" + "mdev type" +
"aggregator count".
for an NVMe VF connecting to a remote storage. it could be "PCI ID" + "driver version" + "configured remote storage URL"
for a QAT VF, it may be "PCI ID" + "driver version" + "supported encryption set".
(to avoid namespace confliction from each vendor, we may prefix a
driver name to
each migration_version string. e.g.
i915-v1-8086-591d-i915-GVTg_V5_8-1)
It's very strange to define it as opaque and then proceed to describe the contents of that opaque string. The point is that its contents are defined by the vendor driver to describe the device, driver
version,
and possibly metadata about the configuration of the device. One instance of a device might generate a different string from another. The string that a device produces is not necessarily the only string the vendor driver will accept, for example the driver might support backwards compatible migrations.
(As I've said in the previous discussion, off one of the patch series)
My view is it makes sense to have a half-way house on the opaqueness of this string; I'd expect to have an ID and version that are human readable, maybe a device ID/name that's human interpretable and then a bunch of other cruft that maybe device/vendor/version specific.
I'm thinking that we want to be able to report problems and include the string and the user to be able to easily identify the device that was complaining and notice a difference in versions, and perhaps also use it in compatibility patterns to find compatible hosts; but that does get tricky when it's a 'ask the device if it's compatible'.
In the reply I just sent to Dan, I gave this example of what a "compatibility string" might look like represented as json:
{ "device_api": "vfio-pci", "vendor": "vendor-driver-name", "version": { "major": 0, "minor": 1 },
The OpenStack Placement service doesn't support to filtering the target host by the semver syntax, altough we can code this filtering logic inside scheduler filtering by python code. Basically, placement only supports filtering the host by traits (it is same thing with labels, tags). The nova scheduler will call the placement service to filter the hosts first, then go through all the scheduler filters. That would be great if the placement service can filter out more hosts which isn't compatible first, and then it is better.
"vfio-pci": { // Based on above device_api "vendor": 0x1234, // Values for the exposed device "device": 0x5678, // Possibly further parameters for a more specific match },
OpenStack already based on vendor and device id to separate the devices into the different resource pool, then the scheduler based on that to filer the hosts, so I think it needn't be the part of this compatibility string.
"mdev_attrs": [ { "attribute0": "VALUE" } ] }
Are you thinking that we might allow the vendor to include a vendor specific array where we'd simply require that both sides have matching fields and values? ie.
"vendor_fields": [ { "unknown_field0": "unknown_value0" }, { "unknown_field1": "unknown_value1" }, ]
Since the placement support traits (labels, tags), so the placement just to matching those fields, so it isn't problem of openstack, since openstack needn't to know the meaning of those fields. But the traits is just a label, it isn't key-value format. But also if we have to, we can code this scheduler filter by python code. But the same thing as above, the invalid host can't be filtered out in the first step placement service filtering.
We could certainly make that part of the spec, but I can't really figure the value of it other than to severely restrict compatibility, which the vendor could already do via the version.major value. Maybe they'd want to put a build timestamp, random uuid, or source sha1 into such a field to make absolutely certain compatibility is only determined between identical builds? Thanks,
Alex
On Wed, 15 Jul 2020 15:37:19 +0800 Alex Xu soulxu@gmail.com wrote:
Alex Williamson alex.williamson@redhat.com 于2020年7月15日周三 上午5:00写道:
On Tue, 14 Jul 2020 18:19:46 +0100 "Dr. David Alan Gilbert" dgilbert@redhat.com wrote:
- Alex Williamson (alex.williamson@redhat.com) wrote:
On Tue, 14 Jul 2020 11:21:29 +0100 Daniel P. Berrangé berrange@redhat.com wrote:
On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote:
hi folks, we are defining a device migration compatibility interface that
helps upper
layer stack like openstack/ovirt/libvirt to check if two devices
are
live migration compatible. The "devices" here could be MDEVs, physical devices, or hybrid of
the two.
e.g. we could use it to check whether
- a src MDEV can migrate to a target MDEV,
- a src VF in SRIOV can migrate to a target VF in SRIOV,
- a src MDEV can migration to a target VF in SRIOV. (e.g. SIOV/SRIOV backward compatibility case)
The upper layer stack could use this interface as the last step to
check
if one device is able to migrate to another device before
triggering a real
live migration procedure. we are not sure if this interface is of value or help to you.
please don't
hesitate to drop your valuable comments.
(1) interface definition The interface is defined in below way:
__ userspace /\ \ / \write / read \
________/__________ ___|/_____________ | migration_version | | migration_version |-->check migration --------------------- --------------------- compatibility device A device B
a device attribute named migration_version is defined under each
device's
sysfs node. e.g.
(/sys/bus/pci/devices/0000:00:02.0/$mdev_UUID/migration_version).
userspace tools read the migration_version as a string from the
source device,
and write it to the migration_version sysfs attribute in the
target device.
The userspace should treat ANY of below conditions as two devices
not compatible:
- any one of the two devices does not have a migration_version
attribute
- error when reading from migration_version attribute of one device
- error when writing migration_version string of one device to migration_version attribute of the other device
The string read from migration_version attribute is defined by
device vendor
driver and is completely opaque to the userspace. for a Intel vGPU, string format can be defined like "parent device PCI ID" + "version of gvt driver" + "mdev type" +
"aggregator count".
for an NVMe VF connecting to a remote storage. it could be "PCI ID" + "driver version" + "configured remote storage URL"
for a QAT VF, it may be "PCI ID" + "driver version" + "supported encryption set".
(to avoid namespace confliction from each vendor, we may prefix a
driver name to
each migration_version string. e.g.
i915-v1-8086-591d-i915-GVTg_V5_8-1)
It's very strange to define it as opaque and then proceed to describe the contents of that opaque string. The point is that its contents are defined by the vendor driver to describe the device, driver
version,
and possibly metadata about the configuration of the device. One instance of a device might generate a different string from another. The string that a device produces is not necessarily the only string the vendor driver will accept, for example the driver might support backwards compatible migrations.
(As I've said in the previous discussion, off one of the patch series)
My view is it makes sense to have a half-way house on the opaqueness of this string; I'd expect to have an ID and version that are human readable, maybe a device ID/name that's human interpretable and then a bunch of other cruft that maybe device/vendor/version specific.
I'm thinking that we want to be able to report problems and include the string and the user to be able to easily identify the device that was complaining and notice a difference in versions, and perhaps also use it in compatibility patterns to find compatible hosts; but that does get tricky when it's a 'ask the device if it's compatible'.
In the reply I just sent to Dan, I gave this example of what a "compatibility string" might look like represented as json:
{ "device_api": "vfio-pci", "vendor": "vendor-driver-name", "version": { "major": 0, "minor": 1 },
The OpenStack Placement service doesn't support to filtering the target host by the semver syntax, altough we can code this filtering logic inside scheduler filtering by python code. Basically, placement only supports filtering the host by traits (it is same thing with labels, tags). The nova scheduler will call the placement service to filter the hosts first, then go through all the scheduler filters. That would be great if the placement service can filter out more hosts which isn't compatible first, and then it is better.
"vfio-pci": { // Based on above device_api "vendor": 0x1234, // Values for the exposed device "device": 0x5678, // Possibly further parameters for a more specific match },
OpenStack already based on vendor and device id to separate the devices into the different resource pool, then the scheduler based on that to filer the hosts, so I think it needn't be the part of this compatibility string.
This is the part of the string that actually says what the resulting device is, so it's a rather fundamental part of the description. This is where we'd determine that a physical to mdev migration is possible or that different mdev types result in the same guest PCI device, possibly with attributes set as specified later in the output.
"mdev_attrs": [ { "attribute0": "VALUE" } ] }
Are you thinking that we might allow the vendor to include a vendor specific array where we'd simply require that both sides have matching fields and values? ie.
That's what I'm defining in the below vendor_fields, the above mdev_attrs would be specifying attributes of the device that must be set in order to create the device with the compatibility described. For example if we're describing compatibility for type foo-1, which is a base type that can be equivalent to type foo-3 if type foo-1 is created with aggregation=3, this is where that would be defined. Thanks,
Alex
"vendor_fields": [ { "unknown_field0": "unknown_value0" }, { "unknown_field1": "unknown_value1" }, ]
Since the placement support traits (labels, tags), so the placement just to matching those fields, so it isn't problem of openstack, since openstack needn't to know the meaning of those fields. But the traits is just a label, it isn't key-value format. But also if we have to, we can code this scheduler filter by python code. But the same thing as above, the invalid host can't be filtered out in the first step placement service filtering.
We could certainly make that part of the spec, but I can't really figure the value of it other than to severely restrict compatibility, which the vendor could already do via the version.major value. Maybe they'd want to put a build timestamp, random uuid, or source sha1 into such a field to make absolutely certain compatibility is only determined between identical builds? Thanks,
Alex
On Tue, Jul 14, 2020 at 02:59:48PM -0600, Alex Williamson wrote:
On Tue, 14 Jul 2020 18:19:46 +0100 "Dr. David Alan Gilbert" dgilbert@redhat.com wrote:
- Alex Williamson (alex.williamson@redhat.com) wrote:
On Tue, 14 Jul 2020 11:21:29 +0100 Daniel P. Berrangé berrange@redhat.com wrote:
On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote:
hi folks, we are defining a device migration compatibility interface that helps upper layer stack like openstack/ovirt/libvirt to check if two devices are live migration compatible. The "devices" here could be MDEVs, physical devices, or hybrid of the two. e.g. we could use it to check whether
- a src MDEV can migrate to a target MDEV,
- a src VF in SRIOV can migrate to a target VF in SRIOV,
- a src MDEV can migration to a target VF in SRIOV. (e.g. SIOV/SRIOV backward compatibility case)
The upper layer stack could use this interface as the last step to check if one device is able to migrate to another device before triggering a real live migration procedure. we are not sure if this interface is of value or help to you. please don't hesitate to drop your valuable comments.
(1) interface definition The interface is defined in below way:
__ userspace /\ \ / \write / read \
________/__________ ___|/_____________ | migration_version | | migration_version |-->check migration --------------------- --------------------- compatibility device A device B
a device attribute named migration_version is defined under each device's sysfs node. e.g. (/sys/bus/pci/devices/0000:00:02.0/$mdev_UUID/migration_version). userspace tools read the migration_version as a string from the source device, and write it to the migration_version sysfs attribute in the target device.
The userspace should treat ANY of below conditions as two devices not compatible:
- any one of the two devices does not have a migration_version attribute
- error when reading from migration_version attribute of one device
- error when writing migration_version string of one device to migration_version attribute of the other device
The string read from migration_version attribute is defined by device vendor driver and is completely opaque to the userspace. for a Intel vGPU, string format can be defined like "parent device PCI ID" + "version of gvt driver" + "mdev type" + "aggregator count".
for an NVMe VF connecting to a remote storage. it could be "PCI ID" + "driver version" + "configured remote storage URL"
for a QAT VF, it may be "PCI ID" + "driver version" + "supported encryption set".
(to avoid namespace confliction from each vendor, we may prefix a driver name to each migration_version string. e.g. i915-v1-8086-591d-i915-GVTg_V5_8-1)
It's very strange to define it as opaque and then proceed to describe the contents of that opaque string. The point is that its contents are defined by the vendor driver to describe the device, driver version, and possibly metadata about the configuration of the device. One instance of a device might generate a different string from another. The string that a device produces is not necessarily the only string the vendor driver will accept, for example the driver might support backwards compatible migrations.
(As I've said in the previous discussion, off one of the patch series)
My view is it makes sense to have a half-way house on the opaqueness of this string; I'd expect to have an ID and version that are human readable, maybe a device ID/name that's human interpretable and then a bunch of other cruft that maybe device/vendor/version specific.
I'm thinking that we want to be able to report problems and include the string and the user to be able to easily identify the device that was complaining and notice a difference in versions, and perhaps also use it in compatibility patterns to find compatible hosts; but that does get tricky when it's a 'ask the device if it's compatible'.
In the reply I just sent to Dan, I gave this example of what a "compatibility string" might look like represented as json:
{ "device_api": "vfio-pci", "vendor": "vendor-driver-name", "version": { "major": 0, "minor": 1 }, "vfio-pci": { // Based on above device_api "vendor": 0x1234, // Values for the exposed device "device": 0x5678, // Possibly further parameters for a more specific match }, "mdev_attrs": [ { "attribute0": "VALUE" } ] }
Are you thinking that we might allow the vendor to include a vendor specific array where we'd simply require that both sides have matching fields and values? ie.
"vendor_fields": [ { "unknown_field0": "unknown_value0" }, { "unknown_field1": "unknown_value1" }, ]
We could certainly make that part of the spec, but I can't really figure the value of it other than to severely restrict compatibility, which the vendor could already do via the version.major value. Maybe they'd want to put a build timestamp, random uuid, or source sha1 into such a field to make absolutely certain compatibility is only determined between identical builds? Thanks,
Yes, I agree kernel could expose such sysfs interface to educate openstack how to filter out devices. But I still think the proposed migration_version (or rename to migration_compatibility) interface is still required for libvirt to do double check.
In the following scenario: 1. openstack chooses the target device by reading sysfs interface (of json format) of the source device. And Openstack are now pretty sure the two devices are migration compatible. 2. openstack asks libvirt to create the target VM with the target device and start live migration. 3. libvirt now receives the request. so it now has two choices: (1) create the target VM & target device and start live migration directly (2) double check if the target device is compatible with the source device before doing the remaining tasks.
Because the factors to determine whether two devices are live migration compatible are complicated and may be dynamically changing, (e.g. driver upgrade or configuration changes), and also because libvirt should not totally rely on the input from openstack, I think the cost for libvirt is relatively lower if it chooses to go (2) than (1). At least it has no need to cancel migration and destroy the VM if it knows it earlier.
So, it means the kernel may need to expose two parallel interfaces: (1) with json format, enumerating all possible fields and comparing methods, so as to indicate openstack how to find a matching target device (2) an opaque driver defined string, requiring write and test in target, which is used by libvirt to make sure device compatibility, rather than rely on the input accurateness from openstack or rely on kernel driver implementing the compatibility detection immediately after migration start.
Does it make sense?
Thanks Yan
-----Original Message----- From: Zhao, Yan Y yan.y.zhao@intel.com Sent: 2020年7月15日 16:21 To: Alex Williamson alex.williamson@redhat.com Cc: Dr. David Alan Gilbert dgilbert@redhat.com; Daniel P. Berrangé berrange@redhat.com; devel@ovirt.org; openstack-discuss@lists.openstack.org; libvir-list@redhat.com; intel-gvt-dev@lists.freedesktop.org; kvm@vger.kernel.org; qemu-devel@nongnu.org; smooney@redhat.com; eskultet@redhat.com; cohuck@redhat.com; dinechin@redhat.com; corbet@lwn.net; kwankhede@nvidia.com; eauger@redhat.com; Ding, Jian-feng jian-feng.ding@intel.com; Xu, Hejie hejie.xu@intel.com; Tian, Kevin kevin.tian@intel.com; zhenyuw@linux.intel.com; bao.yumeng@zte.com.cn; Wang, Xin-ran xin-ran.wang@intel.com; Feng, Shaohe shaohe.feng@intel.com Subject: Re: device compatibility interface for live migration with assigned devices
On Tue, Jul 14, 2020 at 02:59:48PM -0600, Alex Williamson wrote:
On Tue, 14 Jul 2020 18:19:46 +0100 "Dr. David Alan Gilbert" dgilbert@redhat.com wrote:
- Alex Williamson (alex.williamson@redhat.com) wrote:
On Tue, 14 Jul 2020 11:21:29 +0100 Daniel P. Berrangé berrange@redhat.com wrote:
On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote:
hi folks, we are defining a device migration compatibility interface that helps upper layer stack like openstack/ovirt/libvirt to check if two devices are live migration compatible. The "devices" here could be MDEVs, physical devices, or hybrid of the two. e.g. we could use it to check whether
- a src MDEV can migrate to a target MDEV,
- a src VF in SRIOV can migrate to a target VF in SRIOV,
- a src MDEV can migration to a target VF in SRIOV. (e.g. SIOV/SRIOV backward compatibility case)
The upper layer stack could use this interface as the last step to check if one device is able to migrate to another device before triggering a real live migration procedure. we are not sure if this interface is of value or help to you. please don't hesitate to drop your valuable comments.
(1) interface definition The interface is defined in below way:
__ userspace /\ \ / \write / read \
________/__________ ___|/_____________ | migration_version | | migration_version |-->check migration --------------------- --------------------- compatibility device A device B
a device attribute named migration_version is defined under each device's sysfs node. e.g. (/sys/bus/pci/devices/0000:00:02.0/$mdev_UUID/migration_version). userspace tools read the migration_version as a string from the source device, and write it to the migration_version sysfs attribute in the target device.
The userspace should treat ANY of below conditions as two devices not compatible:
- any one of the two devices does not have a migration_version
attribute
- error when reading from migration_version attribute of one
device
- error when writing migration_version string of one device to migration_version attribute of the other device
The string read from migration_version attribute is defined by device vendor driver and is completely opaque to the userspace. for a Intel vGPU, string format can be defined like "parent device PCI ID" + "version of gvt driver" + "mdev type" + "aggregator count".
for an NVMe VF connecting to a remote storage. it could be "PCI ID" + "driver version" + "configured remote storage URL"
for a QAT VF, it may be "PCI ID" + "driver version" + "supported encryption set".
(to avoid namespace confliction from each vendor, we may prefix a driver name to each migration_version string. e.g. i915-v1-8086-591d-i915-GVTg_V5_8-1)
It's very strange to define it as opaque and then proceed to describe the contents of that opaque string. The point is that its contents are defined by the vendor driver to describe the device, driver version, and possibly metadata about the configuration of the device. One instance of a device might generate a different string from another. The string that a device produces is not necessarily the only string the vendor driver will accept, for example the driver might support backwards compatible migrations.
(As I've said in the previous discussion, off one of the patch series)
My view is it makes sense to have a half-way house on the opaqueness of this string; I'd expect to have an ID and version that are human readable, maybe a device ID/name that's human interpretable and then a bunch of other cruft that maybe device/vendor/version specific.
I'm thinking that we want to be able to report problems and include the string and the user to be able to easily identify the device that was complaining and notice a difference in versions, and perhaps also use it in compatibility patterns to find compatible hosts; but that does get tricky when it's a 'ask the device if it's compatible'.
In the reply I just sent to Dan, I gave this example of what a "compatibility string" might look like represented as json:
{ "device_api": "vfio-pci", "vendor": "vendor-driver-name", "version": { "major": 0, "minor": 1 }, "vfio-pci": { // Based on above device_api "vendor": 0x1234, // Values for the exposed device "device": 0x5678, // Possibly further parameters for a more specific match }, "mdev_attrs": [ { "attribute0": "VALUE" } ] }
Are you thinking that we might allow the vendor to include a vendor specific array where we'd simply require that both sides have matching fields and values? ie.
"vendor_fields": [ { "unknown_field0": "unknown_value0" }, { "unknown_field1": "unknown_value1" }, ]
We could certainly make that part of the spec, but I can't really figure the value of it other than to severely restrict compatibility, which the vendor could already do via the version.major value. Maybe they'd want to put a build timestamp, random uuid, or source sha1 into such a field to make absolutely certain compatibility is only determined between identical builds? Thanks,
Yes, I agree kernel could expose such sysfs interface to educate openstack how to filter out devices. But I still think the proposed migration_version (or rename to migration_compatibility) interface is still required for libvirt to do double check.
In the following scenario: 1. openstack chooses the target device by reading sysfs interface (of json format) of the source device. And Openstack are now pretty sure the two devices are migration compatible. 2. openstack asks libvirt to create the target VM with the target device and start live migration. 3. libvirt now receives the request. so it now has two choices: (1) create the target VM & target device and start live migration directly (2) double check if the target device is compatible with the source device before doing the remaining tasks.
Because the factors to determine whether two devices are live migration compatible are complicated and may be dynamically changing, (e.g. driver upgrade or configuration changes), and also because libvirt should not totally rely on the input from openstack, I think the cost for libvirt is relatively lower if it chooses to go (2) than (1). At least it has no need to cancel migration and destroy the VM if it knows it earlier.
So, it means the kernel may need to expose two parallel interfaces: (1) with json format, enumerating all possible fields and comparing methods, so as to indicate openstack how to find a matching target device (2) an opaque driver defined string, requiring write and test in target, which is used by libvirt to make sure device compatibility, rather than rely on the input accurateness from openstack or rely on kernel driver implementing the compatibility detection immediately after migration start.
Does it make sense?
[Feng, Shaohe] Yes, had better 2 interface for different phase of live migration. For (1), it is can leverage these information for scheduler to minimize the failure rate of migration. The problem is that which value should be used for scheduler guide. The values should be human readable. For (2) yes we can't assume that the migration always screenful, double check is needed. BR Shaohe
Thanks Yan
Yan Zhao yan.y.zhao@intel.com 于2020年7月15日周三 下午4:32写道:
On Tue, Jul 14, 2020 at 02:59:48PM -0600, Alex Williamson wrote:
On Tue, 14 Jul 2020 18:19:46 +0100 "Dr. David Alan Gilbert" dgilbert@redhat.com wrote:
- Alex Williamson (alex.williamson@redhat.com) wrote:
On Tue, 14 Jul 2020 11:21:29 +0100 Daniel P. Berrangé berrange@redhat.com wrote:
On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote:
hi folks, we are defining a device migration compatibility interface that
helps upper
layer stack like openstack/ovirt/libvirt to check if two devices
are
live migration compatible. The "devices" here could be MDEVs, physical devices, or hybrid
of the two.
e.g. we could use it to check whether
- a src MDEV can migrate to a target MDEV,
- a src VF in SRIOV can migrate to a target VF in SRIOV,
- a src MDEV can migration to a target VF in SRIOV. (e.g. SIOV/SRIOV backward compatibility case)
The upper layer stack could use this interface as the last step
to check
if one device is able to migrate to another device before
triggering a real
live migration procedure. we are not sure if this interface is of value or help to you.
please don't
hesitate to drop your valuable comments.
(1) interface definition The interface is defined in below way:
__ userspace /\ \ / \write / read \
________/__________ ___|/_____________ | migration_version | | migration_version |-->check
migration
--------------------- --------------------- compatibility device A device B
a device attribute named migration_version is defined under each
device's
sysfs node. e.g.
(/sys/bus/pci/devices/0000:00:02.0/$mdev_UUID/migration_version).
userspace tools read the migration_version as a string from the
source device,
and write it to the migration_version sysfs attribute in the
target device.
The userspace should treat ANY of below conditions as two
devices not compatible:
- any one of the two devices does not have a migration_version
attribute
- error when reading from migration_version attribute of one
device
- error when writing migration_version string of one device to migration_version attribute of the other device
The string read from migration_version attribute is defined by
device vendor
driver and is completely opaque to the userspace. for a Intel vGPU, string format can be defined like "parent device PCI ID" + "version of gvt driver" + "mdev type" +
"aggregator count".
for an NVMe VF connecting to a remote storage. it could be "PCI ID" + "driver version" + "configured remote storage URL"
for a QAT VF, it may be "PCI ID" + "driver version" + "supported encryption set".
(to avoid namespace confliction from each vendor, we may prefix
a driver name to
each migration_version string. e.g.
i915-v1-8086-591d-i915-GVTg_V5_8-1)
It's very strange to define it as opaque and then proceed to describe the contents of that opaque string. The point is that its contents are defined by the vendor driver to describe the device, driver
version,
and possibly metadata about the configuration of the device. One instance of a device might generate a different string from another. The string that a device produces is not necessarily the only string the vendor driver will accept, for example the driver might support backwards compatible migrations.
(As I've said in the previous discussion, off one of the patch series)
My view is it makes sense to have a half-way house on the opaqueness of this string; I'd expect to have an ID and version that are human readable, maybe a device ID/name that's human interpretable and then a bunch of other cruft that maybe device/vendor/version specific.
I'm thinking that we want to be able to report problems and include the string and the user to be able to easily identify the device that was complaining and notice a difference in versions, and perhaps also use it in compatibility patterns to find compatible hosts; but that does get tricky when it's a 'ask the device if it's compatible'.
In the reply I just sent to Dan, I gave this example of what a "compatibility string" might look like represented as json:
{ "device_api": "vfio-pci", "vendor": "vendor-driver-name", "version": { "major": 0, "minor": 1 }, "vfio-pci": { // Based on above device_api "vendor": 0x1234, // Values for the exposed device "device": 0x5678, // Possibly further parameters for a more specific match }, "mdev_attrs": [ { "attribute0": "VALUE" } ] }
Are you thinking that we might allow the vendor to include a vendor specific array where we'd simply require that both sides have matching fields and values? ie.
"vendor_fields": [ { "unknown_field0": "unknown_value0" }, { "unknown_field1": "unknown_value1" }, ]
We could certainly make that part of the spec, but I can't really figure the value of it other than to severely restrict compatibility, which the vendor could already do via the version.major value. Maybe they'd want to put a build timestamp, random uuid, or source sha1 into such a field to make absolutely certain compatibility is only determined between identical builds? Thanks,
Yes, I agree kernel could expose such sysfs interface to educate openstack how to filter out devices. But I still think the proposed migration_version (or rename to migration_compatibility) interface is still required for libvirt to do double check.
In the following scenario:
- openstack chooses the target device by reading sysfs interface (of json
format) of the source device. And Openstack are now pretty sure the two devices are migration compatible. 2. openstack asks libvirt to create the target VM with the target device and start live migration. 3. libvirt now receives the request. so it now has two choices: (1) create the target VM & target device and start live migration directly (2) double check if the target device is compatible with the source device before doing the remaining tasks.
Because the factors to determine whether two devices are live migration compatible are complicated and may be dynamically changing, (e.g. driver upgrade or configuration changes), and also because libvirt should not totally rely on the input from openstack, I think the cost for libvirt is relatively lower if it chooses to go (2) than (1). At least it has no need to cancel migration and destroy the VM if it knows it earlier.
If the driver upgrade or configuration changes, I guess there should be a restart of openstack agent on the host, that will update the info to the scheduler. so it should be fine.
For (2), probably it need be used for double check when the orchestration layer doesn't implement the check logic in the scheduler.
So, it means the kernel may need to expose two parallel interfaces: (1) with json format, enumerating all possible fields and comparing methods, so as to indicate openstack how to find a matching target device (2) an opaque driver defined string, requiring write and test in target, which is used by libvirt to make sure device compatibility, rather than rely on the input accurateness from openstack or rely on kernel driver implementing the compatibility detection immediately after migration start.
Does it make sense?
Thanks Yan
On Wed, 15 Jul 2020 16:20:41 +0800 Yan Zhao yan.y.zhao@intel.com wrote:
On Tue, Jul 14, 2020 at 02:59:48PM -0600, Alex Williamson wrote:
On Tue, 14 Jul 2020 18:19:46 +0100 "Dr. David Alan Gilbert" dgilbert@redhat.com wrote:
- Alex Williamson (alex.williamson@redhat.com) wrote:
On Tue, 14 Jul 2020 11:21:29 +0100 Daniel P. Berrangé berrange@redhat.com wrote:
On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote:
hi folks, we are defining a device migration compatibility interface that helps upper layer stack like openstack/ovirt/libvirt to check if two devices are live migration compatible. The "devices" here could be MDEVs, physical devices, or hybrid of the two. e.g. we could use it to check whether
- a src MDEV can migrate to a target MDEV,
- a src VF in SRIOV can migrate to a target VF in SRIOV,
- a src MDEV can migration to a target VF in SRIOV. (e.g. SIOV/SRIOV backward compatibility case)
The upper layer stack could use this interface as the last step to check if one device is able to migrate to another device before triggering a real live migration procedure. we are not sure if this interface is of value or help to you. please don't hesitate to drop your valuable comments.
(1) interface definition The interface is defined in below way:
__ userspace /\ \ / \write / read \
________/__________ ___|/_____________ | migration_version | | migration_version |-->check migration --------------------- --------------------- compatibility device A device B
a device attribute named migration_version is defined under each device's sysfs node. e.g. (/sys/bus/pci/devices/0000:00:02.0/$mdev_UUID/migration_version). userspace tools read the migration_version as a string from the source device, and write it to the migration_version sysfs attribute in the target device.
The userspace should treat ANY of below conditions as two devices not compatible:
- any one of the two devices does not have a migration_version attribute
- error when reading from migration_version attribute of one device
- error when writing migration_version string of one device to migration_version attribute of the other device
The string read from migration_version attribute is defined by device vendor driver and is completely opaque to the userspace. for a Intel vGPU, string format can be defined like "parent device PCI ID" + "version of gvt driver" + "mdev type" + "aggregator count".
for an NVMe VF connecting to a remote storage. it could be "PCI ID" + "driver version" + "configured remote storage URL"
for a QAT VF, it may be "PCI ID" + "driver version" + "supported encryption set".
(to avoid namespace confliction from each vendor, we may prefix a driver name to each migration_version string. e.g. i915-v1-8086-591d-i915-GVTg_V5_8-1)
It's very strange to define it as opaque and then proceed to describe the contents of that opaque string. The point is that its contents are defined by the vendor driver to describe the device, driver version, and possibly metadata about the configuration of the device. One instance of a device might generate a different string from another. The string that a device produces is not necessarily the only string the vendor driver will accept, for example the driver might support backwards compatible migrations.
(As I've said in the previous discussion, off one of the patch series)
My view is it makes sense to have a half-way house on the opaqueness of this string; I'd expect to have an ID and version that are human readable, maybe a device ID/name that's human interpretable and then a bunch of other cruft that maybe device/vendor/version specific.
I'm thinking that we want to be able to report problems and include the string and the user to be able to easily identify the device that was complaining and notice a difference in versions, and perhaps also use it in compatibility patterns to find compatible hosts; but that does get tricky when it's a 'ask the device if it's compatible'.
In the reply I just sent to Dan, I gave this example of what a "compatibility string" might look like represented as json:
{ "device_api": "vfio-pci", "vendor": "vendor-driver-name", "version": { "major": 0, "minor": 1 }, "vfio-pci": { // Based on above device_api "vendor": 0x1234, // Values for the exposed device "device": 0x5678, // Possibly further parameters for a more specific match }, "mdev_attrs": [ { "attribute0": "VALUE" } ] }
Are you thinking that we might allow the vendor to include a vendor specific array where we'd simply require that both sides have matching fields and values? ie.
"vendor_fields": [ { "unknown_field0": "unknown_value0" }, { "unknown_field1": "unknown_value1" }, ]
We could certainly make that part of the spec, but I can't really figure the value of it other than to severely restrict compatibility, which the vendor could already do via the version.major value. Maybe they'd want to put a build timestamp, random uuid, or source sha1 into such a field to make absolutely certain compatibility is only determined between identical builds? Thanks,
Yes, I agree kernel could expose such sysfs interface to educate openstack how to filter out devices. But I still think the proposed migration_version (or rename to migration_compatibility) interface is still required for libvirt to do double check.
In the following scenario:
- openstack chooses the target device by reading sysfs interface (of json
format) of the source device. And Openstack are now pretty sure the two devices are migration compatible. 2. openstack asks libvirt to create the target VM with the target device and start live migration. 3. libvirt now receives the request. so it now has two choices: (1) create the target VM & target device and start live migration directly (2) double check if the target device is compatible with the source device before doing the remaining tasks.
Because the factors to determine whether two devices are live migration compatible are complicated and may be dynamically changing, (e.g. driver upgrade or configuration changes), and also because libvirt should not totally rely on the input from openstack, I think the cost for libvirt is relatively lower if it chooses to go (2) than (1). At least it has no need to cancel migration and destroy the VM if it knows it earlier.
So, it means the kernel may need to expose two parallel interfaces: (1) with json format, enumerating all possible fields and comparing methods, so as to indicate openstack how to find a matching target device (2) an opaque driver defined string, requiring write and test in target, which is used by libvirt to make sure device compatibility, rather than rely on the input accurateness from openstack or rely on kernel driver implementing the compatibility detection immediately after migration start.
Does it make sense?
No, libvirt is not responsible for the success or failure of the migration, it's the vendor driver's responsibility to encode compatibility information early in the migration stream and error should the incoming device prove to be incompatible. It's not libvirt's job to second guess the management engine and I would not support a duplicate interface only for that purpose. Thanks,
Alex
* Alex Williamson (alex.williamson@redhat.com) wrote:
On Wed, 15 Jul 2020 16:20:41 +0800 Yan Zhao yan.y.zhao@intel.com wrote:
On Tue, Jul 14, 2020 at 02:59:48PM -0600, Alex Williamson wrote:
On Tue, 14 Jul 2020 18:19:46 +0100 "Dr. David Alan Gilbert" dgilbert@redhat.com wrote:
- Alex Williamson (alex.williamson@redhat.com) wrote:
On Tue, 14 Jul 2020 11:21:29 +0100 Daniel P. Berrangé berrange@redhat.com wrote:
On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote: > hi folks, > we are defining a device migration compatibility interface that helps upper > layer stack like openstack/ovirt/libvirt to check if two devices are > live migration compatible. > The "devices" here could be MDEVs, physical devices, or hybrid of the two. > e.g. we could use it to check whether > - a src MDEV can migrate to a target MDEV, > - a src VF in SRIOV can migrate to a target VF in SRIOV, > - a src MDEV can migration to a target VF in SRIOV. > (e.g. SIOV/SRIOV backward compatibility case) > > The upper layer stack could use this interface as the last step to check > if one device is able to migrate to another device before triggering a real > live migration procedure. > we are not sure if this interface is of value or help to you. please don't > hesitate to drop your valuable comments. > > > (1) interface definition > The interface is defined in below way: > > __ userspace > /\ \ > / \write > / read \ > ________/__________ ___|/_____________ > | migration_version | | migration_version |-->check migration > --------------------- --------------------- compatibility > device A device B > > > a device attribute named migration_version is defined under each device's > sysfs node. e.g. (/sys/bus/pci/devices/0000:00:02.0/$mdev_UUID/migration_version). > userspace tools read the migration_version as a string from the source device, > and write it to the migration_version sysfs attribute in the target device. > > The userspace should treat ANY of below conditions as two devices not compatible: > - any one of the two devices does not have a migration_version attribute > - error when reading from migration_version attribute of one device > - error when writing migration_version string of one device to > migration_version attribute of the other device > > The string read from migration_version attribute is defined by device vendor > driver and is completely opaque to the userspace. > for a Intel vGPU, string format can be defined like > "parent device PCI ID" + "version of gvt driver" + "mdev type" + "aggregator count". > > for an NVMe VF connecting to a remote storage. it could be > "PCI ID" + "driver version" + "configured remote storage URL" > > for a QAT VF, it may be > "PCI ID" + "driver version" + "supported encryption set". > > (to avoid namespace confliction from each vendor, we may prefix a driver name to > each migration_version string. e.g. i915-v1-8086-591d-i915-GVTg_V5_8-1)
It's very strange to define it as opaque and then proceed to describe the contents of that opaque string. The point is that its contents are defined by the vendor driver to describe the device, driver version, and possibly metadata about the configuration of the device. One instance of a device might generate a different string from another. The string that a device produces is not necessarily the only string the vendor driver will accept, for example the driver might support backwards compatible migrations.
(As I've said in the previous discussion, off one of the patch series)
My view is it makes sense to have a half-way house on the opaqueness of this string; I'd expect to have an ID and version that are human readable, maybe a device ID/name that's human interpretable and then a bunch of other cruft that maybe device/vendor/version specific.
I'm thinking that we want to be able to report problems and include the string and the user to be able to easily identify the device that was complaining and notice a difference in versions, and perhaps also use it in compatibility patterns to find compatible hosts; but that does get tricky when it's a 'ask the device if it's compatible'.
In the reply I just sent to Dan, I gave this example of what a "compatibility string" might look like represented as json:
{ "device_api": "vfio-pci", "vendor": "vendor-driver-name", "version": { "major": 0, "minor": 1 }, "vfio-pci": { // Based on above device_api "vendor": 0x1234, // Values for the exposed device "device": 0x5678, // Possibly further parameters for a more specific match }, "mdev_attrs": [ { "attribute0": "VALUE" } ] }
Are you thinking that we might allow the vendor to include a vendor specific array where we'd simply require that both sides have matching fields and values? ie.
"vendor_fields": [ { "unknown_field0": "unknown_value0" }, { "unknown_field1": "unknown_value1" }, ]
We could certainly make that part of the spec, but I can't really figure the value of it other than to severely restrict compatibility, which the vendor could already do via the version.major value. Maybe they'd want to put a build timestamp, random uuid, or source sha1 into such a field to make absolutely certain compatibility is only determined between identical builds? Thanks,
Yes, I agree kernel could expose such sysfs interface to educate openstack how to filter out devices. But I still think the proposed migration_version (or rename to migration_compatibility) interface is still required for libvirt to do double check.
In the following scenario:
- openstack chooses the target device by reading sysfs interface (of json
format) of the source device. And Openstack are now pretty sure the two devices are migration compatible. 2. openstack asks libvirt to create the target VM with the target device and start live migration. 3. libvirt now receives the request. so it now has two choices: (1) create the target VM & target device and start live migration directly (2) double check if the target device is compatible with the source device before doing the remaining tasks.
Because the factors to determine whether two devices are live migration compatible are complicated and may be dynamically changing, (e.g. driver upgrade or configuration changes), and also because libvirt should not totally rely on the input from openstack, I think the cost for libvirt is relatively lower if it chooses to go (2) than (1). At least it has no need to cancel migration and destroy the VM if it knows it earlier.
So, it means the kernel may need to expose two parallel interfaces: (1) with json format, enumerating all possible fields and comparing methods, so as to indicate openstack how to find a matching target device (2) an opaque driver defined string, requiring write and test in target, which is used by libvirt to make sure device compatibility, rather than rely on the input accurateness from openstack or rely on kernel driver implementing the compatibility detection immediately after migration start.
Does it make sense?
No, libvirt is not responsible for the success or failure of the migration, it's the vendor driver's responsibility to encode compatibility information early in the migration stream and error should the incoming device prove to be incompatible. It's not libvirt's job to second guess the management engine and I would not support a duplicate interface only for that purpose. Thanks,
libvirt does try to enforce it for other things; trying to stop a bad migration from starting.
Dave
Alex
-- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
On Fri, 17 Jul 2020 19:03:44 +0100 "Dr. David Alan Gilbert" dgilbert@redhat.com wrote:
- Alex Williamson (alex.williamson@redhat.com) wrote:
On Wed, 15 Jul 2020 16:20:41 +0800 Yan Zhao yan.y.zhao@intel.com wrote:
On Tue, Jul 14, 2020 at 02:59:48PM -0600, Alex Williamson wrote:
On Tue, 14 Jul 2020 18:19:46 +0100 "Dr. David Alan Gilbert" dgilbert@redhat.com wrote:
- Alex Williamson (alex.williamson@redhat.com) wrote:
On Tue, 14 Jul 2020 11:21:29 +0100 Daniel P. Berrangé berrange@redhat.com wrote:
> On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote: > > hi folks, > > we are defining a device migration compatibility interface that helps upper > > layer stack like openstack/ovirt/libvirt to check if two devices are > > live migration compatible. > > The "devices" here could be MDEVs, physical devices, or hybrid of the two. > > e.g. we could use it to check whether > > - a src MDEV can migrate to a target MDEV, > > - a src VF in SRIOV can migrate to a target VF in SRIOV, > > - a src MDEV can migration to a target VF in SRIOV. > > (e.g. SIOV/SRIOV backward compatibility case) > > > > The upper layer stack could use this interface as the last step to check > > if one device is able to migrate to another device before triggering a real > > live migration procedure. > > we are not sure if this interface is of value or help to you. please don't > > hesitate to drop your valuable comments. > > > > > > (1) interface definition > > The interface is defined in below way: > > > > __ userspace > > /\ \ > > / \write > > / read \ > > ________/__________ ___|/_____________ > > | migration_version | | migration_version |-->check migration > > --------------------- --------------------- compatibility > > device A device B > > > > > > a device attribute named migration_version is defined under each device's > > sysfs node. e.g. (/sys/bus/pci/devices/0000:00:02.0/$mdev_UUID/migration_version). > > userspace tools read the migration_version as a string from the source device, > > and write it to the migration_version sysfs attribute in the target device. > > > > The userspace should treat ANY of below conditions as two devices not compatible: > > - any one of the two devices does not have a migration_version attribute > > - error when reading from migration_version attribute of one device > > - error when writing migration_version string of one device to > > migration_version attribute of the other device > > > > The string read from migration_version attribute is defined by device vendor > > driver and is completely opaque to the userspace. > > for a Intel vGPU, string format can be defined like > > "parent device PCI ID" + "version of gvt driver" + "mdev type" + "aggregator count". > > > > for an NVMe VF connecting to a remote storage. it could be > > "PCI ID" + "driver version" + "configured remote storage URL" > > > > for a QAT VF, it may be > > "PCI ID" + "driver version" + "supported encryption set". > > > > (to avoid namespace confliction from each vendor, we may prefix a driver name to > > each migration_version string. e.g. i915-v1-8086-591d-i915-GVTg_V5_8-1)
It's very strange to define it as opaque and then proceed to describe the contents of that opaque string. The point is that its contents are defined by the vendor driver to describe the device, driver version, and possibly metadata about the configuration of the device. One instance of a device might generate a different string from another. The string that a device produces is not necessarily the only string the vendor driver will accept, for example the driver might support backwards compatible migrations.
(As I've said in the previous discussion, off one of the patch series)
My view is it makes sense to have a half-way house on the opaqueness of this string; I'd expect to have an ID and version that are human readable, maybe a device ID/name that's human interpretable and then a bunch of other cruft that maybe device/vendor/version specific.
I'm thinking that we want to be able to report problems and include the string and the user to be able to easily identify the device that was complaining and notice a difference in versions, and perhaps also use it in compatibility patterns to find compatible hosts; but that does get tricky when it's a 'ask the device if it's compatible'.
In the reply I just sent to Dan, I gave this example of what a "compatibility string" might look like represented as json:
{ "device_api": "vfio-pci", "vendor": "vendor-driver-name", "version": { "major": 0, "minor": 1 }, "vfio-pci": { // Based on above device_api "vendor": 0x1234, // Values for the exposed device "device": 0x5678, // Possibly further parameters for a more specific match }, "mdev_attrs": [ { "attribute0": "VALUE" } ] }
Are you thinking that we might allow the vendor to include a vendor specific array where we'd simply require that both sides have matching fields and values? ie.
"vendor_fields": [ { "unknown_field0": "unknown_value0" }, { "unknown_field1": "unknown_value1" }, ]
We could certainly make that part of the spec, but I can't really figure the value of it other than to severely restrict compatibility, which the vendor could already do via the version.major value. Maybe they'd want to put a build timestamp, random uuid, or source sha1 into such a field to make absolutely certain compatibility is only determined between identical builds? Thanks,
Yes, I agree kernel could expose such sysfs interface to educate openstack how to filter out devices. But I still think the proposed migration_version (or rename to migration_compatibility) interface is still required for libvirt to do double check.
In the following scenario:
- openstack chooses the target device by reading sysfs interface (of json
format) of the source device. And Openstack are now pretty sure the two devices are migration compatible. 2. openstack asks libvirt to create the target VM with the target device and start live migration. 3. libvirt now receives the request. so it now has two choices: (1) create the target VM & target device and start live migration directly (2) double check if the target device is compatible with the source device before doing the remaining tasks.
Because the factors to determine whether two devices are live migration compatible are complicated and may be dynamically changing, (e.g. driver upgrade or configuration changes), and also because libvirt should not totally rely on the input from openstack, I think the cost for libvirt is relatively lower if it chooses to go (2) than (1). At least it has no need to cancel migration and destroy the VM if it knows it earlier.
So, it means the kernel may need to expose two parallel interfaces: (1) with json format, enumerating all possible fields and comparing methods, so as to indicate openstack how to find a matching target device (2) an opaque driver defined string, requiring write and test in target, which is used by libvirt to make sure device compatibility, rather than rely on the input accurateness from openstack or rely on kernel driver implementing the compatibility detection immediately after migration start.
Does it make sense?
No, libvirt is not responsible for the success or failure of the migration, it's the vendor driver's responsibility to encode compatibility information early in the migration stream and error should the incoming device prove to be incompatible. It's not libvirt's job to second guess the management engine and I would not support a duplicate interface only for that purpose. Thanks,
libvirt does try to enforce it for other things; trying to stop a bad migration from starting.
Even if libvirt did want to verify why would we want to support a separate opaque interface for that purpose versus a parse-able interface? If we get different results, we've failed. Thanks,
Alex
* Alex Williamson (alex.williamson@redhat.com) wrote:
On Tue, 14 Jul 2020 18:19:46 +0100 "Dr. David Alan Gilbert" dgilbert@redhat.com wrote:
- Alex Williamson (alex.williamson@redhat.com) wrote:
On Tue, 14 Jul 2020 11:21:29 +0100 Daniel P. Berrangé berrange@redhat.com wrote:
On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote:
hi folks, we are defining a device migration compatibility interface that helps upper layer stack like openstack/ovirt/libvirt to check if two devices are live migration compatible. The "devices" here could be MDEVs, physical devices, or hybrid of the two. e.g. we could use it to check whether
- a src MDEV can migrate to a target MDEV,
- a src VF in SRIOV can migrate to a target VF in SRIOV,
- a src MDEV can migration to a target VF in SRIOV. (e.g. SIOV/SRIOV backward compatibility case)
The upper layer stack could use this interface as the last step to check if one device is able to migrate to another device before triggering a real live migration procedure. we are not sure if this interface is of value or help to you. please don't hesitate to drop your valuable comments.
(1) interface definition The interface is defined in below way:
__ userspace /\ \ / \write / read \
________/__________ ___|/_____________ | migration_version | | migration_version |-->check migration --------------------- --------------------- compatibility device A device B
a device attribute named migration_version is defined under each device's sysfs node. e.g. (/sys/bus/pci/devices/0000:00:02.0/$mdev_UUID/migration_version). userspace tools read the migration_version as a string from the source device, and write it to the migration_version sysfs attribute in the target device.
The userspace should treat ANY of below conditions as two devices not compatible:
- any one of the two devices does not have a migration_version attribute
- error when reading from migration_version attribute of one device
- error when writing migration_version string of one device to migration_version attribute of the other device
The string read from migration_version attribute is defined by device vendor driver and is completely opaque to the userspace. for a Intel vGPU, string format can be defined like "parent device PCI ID" + "version of gvt driver" + "mdev type" + "aggregator count".
for an NVMe VF connecting to a remote storage. it could be "PCI ID" + "driver version" + "configured remote storage URL"
for a QAT VF, it may be "PCI ID" + "driver version" + "supported encryption set".
(to avoid namespace confliction from each vendor, we may prefix a driver name to each migration_version string. e.g. i915-v1-8086-591d-i915-GVTg_V5_8-1)
It's very strange to define it as opaque and then proceed to describe the contents of that opaque string. The point is that its contents are defined by the vendor driver to describe the device, driver version, and possibly metadata about the configuration of the device. One instance of a device might generate a different string from another. The string that a device produces is not necessarily the only string the vendor driver will accept, for example the driver might support backwards compatible migrations.
(As I've said in the previous discussion, off one of the patch series)
My view is it makes sense to have a half-way house on the opaqueness of this string; I'd expect to have an ID and version that are human readable, maybe a device ID/name that's human interpretable and then a bunch of other cruft that maybe device/vendor/version specific.
I'm thinking that we want to be able to report problems and include the string and the user to be able to easily identify the device that was complaining and notice a difference in versions, and perhaps also use it in compatibility patterns to find compatible hosts; but that does get tricky when it's a 'ask the device if it's compatible'.
In the reply I just sent to Dan, I gave this example of what a "compatibility string" might look like represented as json:
{ "device_api": "vfio-pci", "vendor": "vendor-driver-name", "version": { "major": 0, "minor": 1 }, "vfio-pci": { // Based on above device_api "vendor": 0x1234, // Values for the exposed device "device": 0x5678, // Possibly further parameters for a more specific match }, "mdev_attrs": [ { "attribute0": "VALUE" } ] }
Are you thinking that we might allow the vendor to include a vendor specific array where we'd simply require that both sides have matching fields and values? ie.
"vendor_fields": [ { "unknown_field0": "unknown_value0" }, { "unknown_field1": "unknown_value1" }, ]
We could certainly make that part of the spec, but I can't really figure the value of it other than to severely restrict compatibility, which the vendor could already do via the version.major value. Maybe they'd want to put a build timestamp, random uuid, or source sha1 into such a field to make absolutely certain compatibility is only determined between identical builds? Thanks,
No, I'd mostly anticipated matching on the vendor and device and maybe a version number for the bit the user specifies; I had assumed all that 'vendor cruft' was still mostly opaque; having said that, if it did become a list of attributes like that (some of which were vendor specific) that would make sense to me.
Dave
Alex
-- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Alex Williamson alex.williamson@redhat.com 于2020年7月15日周三 上午12:16写道:
On Tue, 14 Jul 2020 11:21:29 +0100 Daniel P. Berrangé berrange@redhat.com wrote:
On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote:
hi folks, we are defining a device migration compatibility interface that helps
upper
layer stack like openstack/ovirt/libvirt to check if two devices are live migration compatible. The "devices" here could be MDEVs, physical devices, or hybrid of the
two.
e.g. we could use it to check whether
- a src MDEV can migrate to a target MDEV,
- a src VF in SRIOV can migrate to a target VF in SRIOV,
- a src MDEV can migration to a target VF in SRIOV. (e.g. SIOV/SRIOV backward compatibility case)
The upper layer stack could use this interface as the last step to
check
if one device is able to migrate to another device before triggering a
real
live migration procedure. we are not sure if this interface is of value or help to you. please
don't
hesitate to drop your valuable comments.
(1) interface definition The interface is defined in below way:
__ userspace /\ \ / \write / read \
________/__________ ___|/_____________ | migration_version | | migration_version |-->check migration --------------------- --------------------- compatibility device A device B
a device attribute named migration_version is defined under each
device's
sysfs node. e.g.
(/sys/bus/pci/devices/0000:00:02.0/$mdev_UUID/migration_version).
userspace tools read the migration_version as a string from the source
device,
and write it to the migration_version sysfs attribute in the target
device.
The userspace should treat ANY of below conditions as two devices not
compatible:
- any one of the two devices does not have a migration_version
attribute
- error when reading from migration_version attribute of one device
- error when writing migration_version string of one device to migration_version attribute of the other device
The string read from migration_version attribute is defined by device
vendor
driver and is completely opaque to the userspace. for a Intel vGPU, string format can be defined like "parent device PCI ID" + "version of gvt driver" + "mdev type" +
"aggregator count".
for an NVMe VF connecting to a remote storage. it could be "PCI ID" + "driver version" + "configured remote storage URL"
If the "configured remote storage URL" is something configuration setting before the usage, then it isn't something we need for migration compatible check. Openstack only needs to know the target device's driver and hardware compatible for migration, then the scheduler will choose a host which such device, and then Openstack will pre-configure the target host and target device before the migration, then openstack will configure the correct remote storage URL to the device. If we want, we can do a sanity check after the live migration with the os.
for a QAT VF, it may be "PCI ID" + "driver version" + "supported encryption set".
(to avoid namespace confliction from each vendor, we may prefix a
driver name to
each migration_version string. e.g. i915-v1-8086-591d-i915-GVTg_V5_8-1)
It's very strange to define it as opaque and then proceed to describe the contents of that opaque string. The point is that its contents are defined by the vendor driver to describe the device, driver version, and possibly metadata about the configuration of the device. One instance of a device might generate a different string from another. The string that a device produces is not necessarily the only string the vendor driver will accept, for example the driver might support backwards compatible migrations.
(2) backgrounds
The reason we hope the migration_version string is opaque to the
userspace
is that it is hard to generalize standard comparing fields and
comparing
methods for different devices from different vendors. Though userspace now could still do a simple string compare to check if two devices are compatible, and result should also be right, it's still too limited as it excludes the possible candidate whose
migration_version
string fails to be equal. e.g. an MDEV with mdev_type_1, aggregator count 3 is probably
compatible
with another MDEV with mdev_type_3, aggregator count 1, even their migration_version strings are not equal. (assumed mdev_type_3 is of 3 times equal resources of mdev_type_1).
besides that, driver version + configured resources are all elements
demanding
to take into account.
So, we hope leaving the freedom to vendor driver and let it make the
final decision
in a simple reading from source side and writing for test in the
target side way.
we then think the device compatibility issues for live migration with
assigned
devices can be divided into two steps: a. management tools filter out possible migration target devices. Tags could be created according to info from product specification. we think openstack/ovirt may have vendor proprietary components to
create
those customized tags for each product from each vendor.
for Intel vGPU, with a vGPU(a MDEV device) in source side, the tags
to
search target vGPU are like: a tag for compatible parent PCI IDs, a tag for a range of gvt driver versions, a tag for a range of mdev type + aggregator count
for NVMe VF, the tags to search target VF may be like: a tag for compatible PCI IDs, a tag for a range of driver versions, a tag for URL of configured remote storage.
I interpret this as hand waving, ie. the first step is for management tools to make a good guess :-\ We don't seem to be willing to say that a given mdev type can only migrate to a device with that same type. There's this aggregation discussion happening separately where a base mdev type might be created or later configured to be equivalent to a different type. The vfio migration API we've defined is also not limited to mdev devices, for example we could create vendor specific quirks or hooks to provide migration support for a physical PF/VF device. Within the realm of possibility then is that we could migrate between a physical device and an mdev device, which are simply different degrees of creating a virtualization layer in front of the device.
Requiring management application developers to figure out this possible compatibility based on prod specs is really unrealistic. Product specs are typically as clear as mud, and with the suggestion we consider different rules for different types of devices, add up to a huge amount of complexity. This isn't something app developers should have to spend their time figuring out.
Agreed.
The suggestion that we make use of vendor proprietary helper components is totally unacceptable. We need to be able to build a solution that works with exclusively an open source software stack.
I'm surprised to see this as well, but I'm not sure if Yan was really suggesting proprietary software so much as just vendor specific knowledge.
IMHO there needs to be a mechanism for the kernel to report via sysfs what versions are supported on a given device. This puts the job of reporting compatible versions directly under the responsibility of the vendor who writes the kernel driver for it. They are the ones with the best knowledge of the hardware they've built and the rules around its compatibility.
The version string discussed previously is the version string that represents a given device, possibly including driver information, configuration, etc. I think what you're asking for here is an enumeration of every possible version string that a given device could accept as an incoming migration stream. If we consider the string as opaque, that means the vendor driver needs to generate a separate string for every possible version it could accept, for every possible configuration option. That potentially becomes an excessive amount of data to either generate or manage.
For the configuration options, there are two kinds of configuration options are needn't for the migration check.
* The configuration option makes the device different, for example(could be wrong example, not matching any real hardware), A GPU supports 1024* 768 resolution and 800 * 600 resolution VGPUs, the OpenStack will separate this two kinds of VGPUs into two separate resource pool. so the scheduler already ensures we get a host with such vGPU support. so it needn't encode into the 'version string' discussed here. * The configuration option is setting before usage, just like the 'configured remote storage URL' above, it needn't encoded into the 'version string' also. Since the openstack will configure the correct value before the migration.
Am I overestimating how vendors intend to use the version string?
We'd also need to consider devices that we could create, for instance providing the same interface enumeration prior to creating an mdev device to have a confidence level that the new device would be a valid target.
We defined the string as opaque to allow vendor flexibility and because defining a common format is hard. Do we need to revisit this part of the discussion to define the version string as non-opaque with parsing rules, probably with separate incoming vs outgoing interfaces? Thanks,
Alex
On 2020/7/14 上午7:29, Yan Zhao wrote:
hi folks, we are defining a device migration compatibility interface that helps upper layer stack like openstack/ovirt/libvirt to check if two devices are live migration compatible. The "devices" here could be MDEVs, physical devices, or hybrid of the two. e.g. we could use it to check whether
- a src MDEV can migrate to a target MDEV,
- a src VF in SRIOV can migrate to a target VF in SRIOV,
- a src MDEV can migration to a target VF in SRIOV. (e.g. SIOV/SRIOV backward compatibility case)
The upper layer stack could use this interface as the last step to check if one device is able to migrate to another device before triggering a real live migration procedure. we are not sure if this interface is of value or help to you. please don't hesitate to drop your valuable comments.
(1) interface definition The interface is defined in below way:
__ userspace /\ \ / \write / read \ ________/__________ ___\|/_____________
| migration_version | | migration_version |-->check migration --------------------- --------------------- compatibility device A device B
a device attribute named migration_version is defined under each device's sysfs node. e.g. (/sys/bus/pci/devices/0000:00:02.0/$mdev_UUID/migration_version).
Are you aware of the devlink based device management interface that is proposed upstream? I think it has many advantages over sysfs, do you consider to switch to that?
userspace tools read the migration_version as a string from the source device, and write it to the migration_version sysfs attribute in the target device.
The userspace should treat ANY of below conditions as two devices not compatible:
- any one of the two devices does not have a migration_version attribute
- error when reading from migration_version attribute of one device
- error when writing migration_version string of one device to migration_version attribute of the other device
The string read from migration_version attribute is defined by device vendor driver and is completely opaque to the userspace.
My understanding is that something opaque to userspace is not the philosophy of Linux. Instead of having a generic API but opaque value, why not do in a vendor specific way like:
1) exposing the device capability in a vendor specific way via sysfs/devlink or other API 2) management read capability in both src and dst and determine whether we can do the migration
This is the way we plan to do with vDPA.
Thanks
for a Intel vGPU, string format can be defined like "parent device PCI ID" + "version of gvt driver" + "mdev type" + "aggregator count".
for an NVMe VF connecting to a remote storage. it could be "PCI ID" + "driver version" + "configured remote storage URL"
for a QAT VF, it may be "PCI ID" + "driver version" + "supported encryption set".
(to avoid namespace confliction from each vendor, we may prefix a driver name to each migration_version string. e.g. i915-v1-8086-591d-i915-GVTg_V5_8-1)
(2) backgrounds
The reason we hope the migration_version string is opaque to the userspace is that it is hard to generalize standard comparing fields and comparing methods for different devices from different vendors. Though userspace now could still do a simple string compare to check if two devices are compatible, and result should also be right, it's still too limited as it excludes the possible candidate whose migration_version string fails to be equal. e.g. an MDEV with mdev_type_1, aggregator count 3 is probably compatible with another MDEV with mdev_type_3, aggregator count 1, even their migration_version strings are not equal. (assumed mdev_type_3 is of 3 times equal resources of mdev_type_1).
besides that, driver version + configured resources are all elements demanding to take into account.
So, we hope leaving the freedom to vendor driver and let it make the final decision in a simple reading from source side and writing for test in the target side way.
we then think the device compatibility issues for live migration with assigned devices can be divided into two steps: a. management tools filter out possible migration target devices. Tags could be created according to info from product specification. we think openstack/ovirt may have vendor proprietary components to create those customized tags for each product from each vendor. e.g. for Intel vGPU, with a vGPU(a MDEV device) in source side, the tags to search target vGPU are like: a tag for compatible parent PCI IDs, a tag for a range of gvt driver versions, a tag for a range of mdev type + aggregator count
for NVMe VF, the tags to search target VF may be like: a tag for compatible PCI IDs, a tag for a range of driver versions, a tag for URL of configured remote storage.
b. with the output from step a, openstack/ovirt/libvirt could use our proposed device migration compatibility interface to make sure the two devices are indeed live migration compatible before launching the real live migration process to start stream copying, src device stopping and target device resuming. It is supposed that this step would not bring any performance penalty as -in kernel it's just a simple string decoding and comparing -in openstack/ovirt, it could be done by extending current function check_can_live_migrate_destination, along side claiming target resources.[1]
[1] https://specs.openstack.org/openstack/nova-specs/specs/stein/approved/libvir...
Thanks Yan
On Thu, Jul 16, 2020 at 12:16:26PM +0800, Jason Wang wrote:
On 2020/7/14 上午7:29, Yan Zhao wrote:
hi folks, we are defining a device migration compatibility interface that helps upper layer stack like openstack/ovirt/libvirt to check if two devices are live migration compatible. The "devices" here could be MDEVs, physical devices, or hybrid of the two. e.g. we could use it to check whether
- a src MDEV can migrate to a target MDEV,
- a src VF in SRIOV can migrate to a target VF in SRIOV,
- a src MDEV can migration to a target VF in SRIOV. (e.g. SIOV/SRIOV backward compatibility case)
The upper layer stack could use this interface as the last step to check if one device is able to migrate to another device before triggering a real live migration procedure. we are not sure if this interface is of value or help to you. please don't hesitate to drop your valuable comments.
(1) interface definition The interface is defined in below way:
__ userspace /\ \ / \write / read \ ________/__________ ___\|/_____________
| migration_version | | migration_version |-->check migration --------------------- --------------------- compatibility device A device B
a device attribute named migration_version is defined under each device's sysfs node. e.g. (/sys/bus/pci/devices/0000:00:02.0/$mdev_UUID/migration_version).
Are you aware of the devlink based device management interface that is proposed upstream? I think it has many advantages over sysfs, do you consider to switch to that?
not familiar with the devlink. will do some research of it.
userspace tools read the migration_version as a string from the source device, and write it to the migration_version sysfs attribute in the target device.
The userspace should treat ANY of below conditions as two devices not compatible:
- any one of the two devices does not have a migration_version attribute
- error when reading from migration_version attribute of one device
- error when writing migration_version string of one device to migration_version attribute of the other device
The string read from migration_version attribute is defined by device vendor driver and is completely opaque to the userspace.
My understanding is that something opaque to userspace is not the philosophy
but the VFIO live migration in itself is essentially a big opaque stream to userspace.
of Linux. Instead of having a generic API but opaque value, why not do in a vendor specific way like:
- exposing the device capability in a vendor specific way via sysfs/devlink
or other API 2) management read capability in both src and dst and determine whether we can do the migration
This is the way we plan to do with vDPA.
yes, in another reply, Alex proposed to use an interface in json format. I guess we can define something like
{ "self" : [ { "pciid" : "8086591d", "driver" : "i915", "gvt-version" : "v1", "mdev_type" : "i915-GVTg_V5_2", "aggregator" : "1", "pv-mode" : "none", } ], "compatible" : [ { "pciid" : "8086591d", "driver" : "i915", "gvt-version" : "v1", "mdev_type" : "i915-GVTg_V5_2", "aggregator" : "1" "pv-mode" : "none", }, { "pciid" : "8086591d", "driver" : "i915", "gvt-version" : "v1", "mdev_type" : "i915-GVTg_V5_4", "aggregator" : "2" "pv-mode" : "none", }, { "pciid" : "8086591d", "driver" : "i915", "gvt-version" : "v2", "mdev_type" : "i915-GVTg_V5_4", "aggregator" : "2" "pv-mode" : "none, ppgtt, context", } ... ] }
But as those fields are mostly vendor specific, the userspace can only do simple string comparing, I guess the list would be very long as it needs to enumerate all possible targets. also, in some fileds like "gvt-version", is there a simple way to express things like v2+?
If the userspace can read this interface both in src and target and check whether both src and target are in corresponding compatible list, I think it will work for us.
But still, kernel should not rely on userspace's choice, the opaque compatibility string is still required in kernel. No matter whether it would be exposed to userspace as an compatibility checking interface, vendor driver would keep this part of code and embed the string into the migration stream. so exposing it as an interface to be used by libvirt to do a safety check before a real live migration is only about enabling the kernel part of check to happen ahead.
Thanks Yan
for a Intel vGPU, string format can be defined like "parent device PCI ID" + "version of gvt driver" + "mdev type" + "aggregator count".
for an NVMe VF connecting to a remote storage. it could be "PCI ID" + "driver version" + "configured remote storage URL"
for a QAT VF, it may be "PCI ID" + "driver version" + "supported encryption set".
(to avoid namespace confliction from each vendor, we may prefix a driver name to each migration_version string. e.g. i915-v1-8086-591d-i915-GVTg_V5_8-1)
(2) backgrounds
The reason we hope the migration_version string is opaque to the userspace is that it is hard to generalize standard comparing fields and comparing methods for different devices from different vendors. Though userspace now could still do a simple string compare to check if two devices are compatible, and result should also be right, it's still too limited as it excludes the possible candidate whose migration_version string fails to be equal. e.g. an MDEV with mdev_type_1, aggregator count 3 is probably compatible with another MDEV with mdev_type_3, aggregator count 1, even their migration_version strings are not equal. (assumed mdev_type_3 is of 3 times equal resources of mdev_type_1).
besides that, driver version + configured resources are all elements demanding to take into account.
So, we hope leaving the freedom to vendor driver and let it make the final decision in a simple reading from source side and writing for test in the target side way.
we then think the device compatibility issues for live migration with assigned devices can be divided into two steps: a. management tools filter out possible migration target devices. Tags could be created according to info from product specification. we think openstack/ovirt may have vendor proprietary components to create those customized tags for each product from each vendor. e.g. for Intel vGPU, with a vGPU(a MDEV device) in source side, the tags to search target vGPU are like: a tag for compatible parent PCI IDs, a tag for a range of gvt driver versions, a tag for a range of mdev type + aggregator count
for NVMe VF, the tags to search target VF may be like: a tag for compatible PCI IDs, a tag for a range of driver versions, a tag for URL of configured remote storage.
b. with the output from step a, openstack/ovirt/libvirt could use our proposed device migration compatibility interface to make sure the two devices are indeed live migration compatible before launching the real live migration process to start stream copying, src device stopping and target device resuming. It is supposed that this step would not bring any performance penalty as -in kernel it's just a simple string decoding and comparing -in openstack/ovirt, it could be done by extending current function check_can_live_migrate_destination, along side claiming target resources.[1]
[1] https://specs.openstack.org/openstack/nova-specs/specs/stein/approved/libvir...
Thanks Yan
On 2020/7/16 下午4:32, Yan Zhao wrote:
On Thu, Jul 16, 2020 at 12:16:26PM +0800, Jason Wang wrote:
On 2020/7/14 上午7:29, Yan Zhao wrote:
hi folks, we are defining a device migration compatibility interface that helps upper layer stack like openstack/ovirt/libvirt to check if two devices are live migration compatible. The "devices" here could be MDEVs, physical devices, or hybrid of the two. e.g. we could use it to check whether
- a src MDEV can migrate to a target MDEV,
- a src VF in SRIOV can migrate to a target VF in SRIOV,
- a src MDEV can migration to a target VF in SRIOV. (e.g. SIOV/SRIOV backward compatibility case)
The upper layer stack could use this interface as the last step to check if one device is able to migrate to another device before triggering a real live migration procedure. we are not sure if this interface is of value or help to you. please don't hesitate to drop your valuable comments.
(1) interface definition The interface is defined in below way:
__ userspace /\ \ / \write / read \ ________/__________ ___\|/_____________ | migration_version | | migration_version |-->check migration --------------------- --------------------- compatibility device A device B
a device attribute named migration_version is defined under each device's sysfs node. e.g. (/sys/bus/pci/devices/0000:00:02.0/$mdev_UUID/migration_version).
Are you aware of the devlink based device management interface that is proposed upstream? I think it has many advantages over sysfs, do you consider to switch to that?
not familiar with the devlink. will do some research of it.
userspace tools read the migration_version as a string from the source device, and write it to the migration_version sysfs attribute in the target device.
The userspace should treat ANY of below conditions as two devices not compatible:
- any one of the two devices does not have a migration_version attribute
- error when reading from migration_version attribute of one device
- error when writing migration_version string of one device to migration_version attribute of the other device
The string read from migration_version attribute is defined by device vendor driver and is completely opaque to the userspace.
My understanding is that something opaque to userspace is not the philosophy
but the VFIO live migration in itself is essentially a big opaque stream to userspace.
I think it's better not limit to the kernel interface for a specific use case. This is basically the device introspection.
of Linux. Instead of having a generic API but opaque value, why not do in a vendor specific way like:
- exposing the device capability in a vendor specific way via sysfs/devlink
or other API 2) management read capability in both src and dst and determine whether we can do the migration
This is the way we plan to do with vDPA.
yes, in another reply, Alex proposed to use an interface in json format. I guess we can define something like
{ "self" : [ { "pciid" : "8086591d", "driver" : "i915", "gvt-version" : "v1", "mdev_type" : "i915-GVTg_V5_2", "aggregator" : "1", "pv-mode" : "none", } ], "compatible" : [ { "pciid" : "8086591d", "driver" : "i915", "gvt-version" : "v1", "mdev_type" : "i915-GVTg_V5_2", "aggregator" : "1" "pv-mode" : "none", }, { "pciid" : "8086591d", "driver" : "i915", "gvt-version" : "v1", "mdev_type" : "i915-GVTg_V5_4", "aggregator" : "2" "pv-mode" : "none", }, { "pciid" : "8086591d", "driver" : "i915", "gvt-version" : "v2", "mdev_type" : "i915-GVTg_V5_4", "aggregator" : "2" "pv-mode" : "none, ppgtt, context", } ... ] }
This is probably another call for devlink base interface.
But as those fields are mostly vendor specific, the userspace can only do simple string comparing, I guess the list would be very long as it needs to enumerate all possible targets. also, in some fileds like "gvt-version", is there a simple way to express things like v2+?
That's total vendor specific I think. If "v2+" means it only support a version 2+, we can introduce fields like min_version and max_version. But again, the point is to let such interfaces vendor specific instead of trying to have a generic format.
If the userspace can read this interface both in src and target and check whether both src and target are in corresponding compatible list, I think it will work for us.
But still, kernel should not rely on userspace's choice, the opaque compatibility string is still required in kernel. No matter whether it would be exposed to userspace as an compatibility checking interface, vendor driver would keep this part of code and embed the string into the migration stream.
Why? Can we simply do:
1) Src support feature A, B, C (version 1.0) 2) Dst support feature A, B, C, D (version 2.0) 3) only enable feature A, B, C in destination in a version specific way (set version to 1.0) 4) migrate metadata A, B, C
so exposing it as an interface to be used by libvirt to do a safety check before a real live migration is only about enabling the kernel part of check to happen ahead.
If we've already exposed the capability, there's no need for an extra check like compatibility string.
Thanks
Thanks Yan
for a Intel vGPU, string format can be defined like "parent device PCI ID" + "version of gvt driver" + "mdev type" + "aggregator count".
for an NVMe VF connecting to a remote storage. it could be "PCI ID" + "driver version" + "configured remote storage URL"
for a QAT VF, it may be "PCI ID" + "driver version" + "supported encryption set".
(to avoid namespace confliction from each vendor, we may prefix a driver name to each migration_version string. e.g. i915-v1-8086-591d-i915-GVTg_V5_8-1)
(2) backgrounds
The reason we hope the migration_version string is opaque to the userspace is that it is hard to generalize standard comparing fields and comparing methods for different devices from different vendors. Though userspace now could still do a simple string compare to check if two devices are compatible, and result should also be right, it's still too limited as it excludes the possible candidate whose migration_version string fails to be equal. e.g. an MDEV with mdev_type_1, aggregator count 3 is probably compatible with another MDEV with mdev_type_3, aggregator count 1, even their migration_version strings are not equal. (assumed mdev_type_3 is of 3 times equal resources of mdev_type_1).
besides that, driver version + configured resources are all elements demanding to take into account.
So, we hope leaving the freedom to vendor driver and let it make the final decision in a simple reading from source side and writing for test in the target side way.
we then think the device compatibility issues for live migration with assigned devices can be divided into two steps: a. management tools filter out possible migration target devices. Tags could be created according to info from product specification. we think openstack/ovirt may have vendor proprietary components to create those customized tags for each product from each vendor. e.g. for Intel vGPU, with a vGPU(a MDEV device) in source side, the tags to search target vGPU are like: a tag for compatible parent PCI IDs, a tag for a range of gvt driver versions, a tag for a range of mdev type + aggregator count
for NVMe VF, the tags to search target VF may be like: a tag for compatible PCI IDs, a tag for a range of driver versions, a tag for URL of configured remote storage.
b. with the output from step a, openstack/ovirt/libvirt could use our proposed device migration compatibility interface to make sure the two devices are indeed live migration compatible before launching the real live migration process to start stream copying, src device stopping and target device resuming. It is supposed that this step would not bring any performance penalty as -in kernel it's just a simple string decoding and comparing -in openstack/ovirt, it could be done by extending current function check_can_live_migrate_destination, along side claiming target resources.[1]
[1] https://specs.openstack.org/openstack/nova-specs/specs/stein/approved/libvir...
Thanks Yan
On Thu, 16 Jul 2020 16:32:30 +0800 Yan Zhao yan.y.zhao@intel.com wrote:
On Thu, Jul 16, 2020 at 12:16:26PM +0800, Jason Wang wrote:
On 2020/7/14 上午7:29, Yan Zhao wrote:
hi folks, we are defining a device migration compatibility interface that helps upper layer stack like openstack/ovirt/libvirt to check if two devices are live migration compatible. The "devices" here could be MDEVs, physical devices, or hybrid of the two. e.g. we could use it to check whether
- a src MDEV can migrate to a target MDEV,
- a src VF in SRIOV can migrate to a target VF in SRIOV,
- a src MDEV can migration to a target VF in SRIOV. (e.g. SIOV/SRIOV backward compatibility case)
The upper layer stack could use this interface as the last step to check if one device is able to migrate to another device before triggering a real live migration procedure. we are not sure if this interface is of value or help to you. please don't hesitate to drop your valuable comments.
(1) interface definition The interface is defined in below way:
__ userspace /\ \ / \write / read \ ________/__________ ___\|/_____________
| migration_version | | migration_version |-->check migration --------------------- --------------------- compatibility device A device B
a device attribute named migration_version is defined under each device's sysfs node. e.g. (/sys/bus/pci/devices/0000:00:02.0/$mdev_UUID/migration_version).
Are you aware of the devlink based device management interface that is proposed upstream? I think it has many advantages over sysfs, do you consider to switch to that?
Advantages, such as?
not familiar with the devlink. will do some research of it.
userspace tools read the migration_version as a string from the source device, and write it to the migration_version sysfs attribute in the target device.
The userspace should treat ANY of below conditions as two devices not compatible:
- any one of the two devices does not have a migration_version attribute
- error when reading from migration_version attribute of one device
- error when writing migration_version string of one device to migration_version attribute of the other device
The string read from migration_version attribute is defined by device vendor driver and is completely opaque to the userspace.
My understanding is that something opaque to userspace is not the philosophy
but the VFIO live migration in itself is essentially a big opaque stream to userspace.
of Linux. Instead of having a generic API but opaque value, why not do in a vendor specific way like:
- exposing the device capability in a vendor specific way via sysfs/devlink
or other API 2) management read capability in both src and dst and determine whether we can do the migration
This is the way we plan to do with vDPA.
yes, in another reply, Alex proposed to use an interface in json format. I guess we can define something like
{ "self" : [ { "pciid" : "8086591d", "driver" : "i915", "gvt-version" : "v1", "mdev_type" : "i915-GVTg_V5_2", "aggregator" : "1", "pv-mode" : "none", } ], "compatible" : [ { "pciid" : "8086591d", "driver" : "i915", "gvt-version" : "v1", "mdev_type" : "i915-GVTg_V5_2", "aggregator" : "1" "pv-mode" : "none", }, { "pciid" : "8086591d", "driver" : "i915", "gvt-version" : "v1", "mdev_type" : "i915-GVTg_V5_4", "aggregator" : "2" "pv-mode" : "none", }, { "pciid" : "8086591d", "driver" : "i915", "gvt-version" : "v2", "mdev_type" : "i915-GVTg_V5_4", "aggregator" : "2" "pv-mode" : "none, ppgtt, context", } ... ] }
But as those fields are mostly vendor specific, the userspace can only do simple string comparing, I guess the list would be very long as it needs to enumerate all possible targets.
This ignores so much of what I tried to achieve in my example :(
also, in some fileds like "gvt-version", is there a simple way to express things like v2+?
That's not a reasonable thing to express anyway, how can you be certain that v3 won't break compatibility with v2? Sean proposed a versioning scheme that accounts for this, using an x.y.z version expressing the major, minor, and bugfix versions, where there is no compatibility across major versions, minor versions have forward compatibility (ex. 1 -> 2 is ok, 2 -> 1 is not) and bugfix version number indicates some degree of internal improvement that is not visible to the user in terms of features or compatibility, but provides a basis for preferring equally compatible candidates.
If the userspace can read this interface both in src and target and check whether both src and target are in corresponding compatible list, I think it will work for us.
But still, kernel should not rely on userspace's choice, the opaque compatibility string is still required in kernel. No matter whether it would be exposed to userspace as an compatibility checking interface, vendor driver would keep this part of code and embed the string into the migration stream. so exposing it as an interface to be used by libvirt to do a safety check before a real live migration is only about enabling the kernel part of check to happen ahead.
As you indicate, the vendor driver is responsible for checking version information embedded within the migration stream. Therefore a migration should fail early if the devices are incompatible. Is it really libvirt's place to second guess what it has been directed to do? Why would we even proceed to design a user parse-able version interface if we still have a dependency on an opaque interface? Thanks,
Alex
On 2020/7/18 上午12:12, Alex Williamson wrote:
On Thu, 16 Jul 2020 16:32:30 +0800 Yan Zhao yan.y.zhao@intel.com wrote:
On Thu, Jul 16, 2020 at 12:16:26PM +0800, Jason Wang wrote:
On 2020/7/14 上午7:29, Yan Zhao wrote:
hi folks, we are defining a device migration compatibility interface that helps upper layer stack like openstack/ovirt/libvirt to check if two devices are live migration compatible. The "devices" here could be MDEVs, physical devices, or hybrid of the two. e.g. we could use it to check whether
- a src MDEV can migrate to a target MDEV,
- a src VF in SRIOV can migrate to a target VF in SRIOV,
- a src MDEV can migration to a target VF in SRIOV. (e.g. SIOV/SRIOV backward compatibility case)
The upper layer stack could use this interface as the last step to check if one device is able to migrate to another device before triggering a real live migration procedure. we are not sure if this interface is of value or help to you. please don't hesitate to drop your valuable comments.
(1) interface definition The interface is defined in below way:
__ userspace /\ \ / \write / read \ ________/__________ ___\|/_____________ | migration_version | | migration_version |-->check migration --------------------- --------------------- compatibility device A device B
a device attribute named migration_version is defined under each device's sysfs node. e.g. (/sys/bus/pci/devices/0000:00:02.0/$mdev_UUID/migration_version).
Are you aware of the devlink based device management interface that is proposed upstream? I think it has many advantages over sysfs, do you consider to switch to that?
Advantages, such as?
My understanding for devlink(netlink) over sysfs (some are mentioned at the time of vDPA sysfs mgmt API discussion) are:
- existing users (NIC, crypto, SCSI, ib), mature and stable - much better error reporting (ext_ack other than string or errno) - namespace aware - do not couple with kobject
Thanks
On Mon, 2020-07-20 at 11:41 +0800, Jason Wang wrote:
On 2020/7/18 上午12:12, Alex Williamson wrote:
On Thu, 16 Jul 2020 16:32:30 +0800 Yan Zhao yan.y.zhao@intel.com wrote:
On Thu, Jul 16, 2020 at 12:16:26PM +0800, Jason Wang wrote:
On 2020/7/14 上午7:29, Yan Zhao wrote:
hi folks, we are defining a device migration compatibility interface that helps upper layer stack like openstack/ovirt/libvirt to check if two devices are live migration compatible. The "devices" here could be MDEVs, physical devices, or hybrid of the two. e.g. we could use it to check whether
- a src MDEV can migrate to a target MDEV,
- a src VF in SRIOV can migrate to a target VF in SRIOV,
- a src MDEV can migration to a target VF in SRIOV. (e.g. SIOV/SRIOV backward compatibility case)
The upper layer stack could use this interface as the last step to check if one device is able to migrate to another device before triggering a real live migration procedure. we are not sure if this interface is of value or help to you. please don't hesitate to drop your valuable comments.
(1) interface definition The interface is defined in below way:
__ userspace /\ \ / \write / read \ ________/__________ ___\|/_____________ | migration_version | | migration_version |-->check migration --------------------- --------------------- compatibility device A device B
a device attribute named migration_version is defined under each device's sysfs node. e.g. (/sys/bus/pci/devices/0000:00:02.0/$mdev_UUID/migration_version).
Are you aware of the devlink based device management interface that is proposed upstream? I think it has many advantages over sysfs, do you consider to switch to that?
Advantages, such as?
My understanding for devlink(netlink) over sysfs (some are mentioned at the time of vDPA sysfs mgmt API discussion) are:
i tought netlink was used more a as a configuration protocoal to qurry and confire nic and i guess other devices in its devlink form requireint a tool to be witten that can speak the protocal to interact with. the primary advantate of sysfs is that everything is just a file. there are no addtional depleenceis needed and unlike netlink there are not interoperatblity issues in a coanitnerised env. if you are using diffrenet version of libc and gcc in the contaienr vs the host my understanding is tools like ethtool from ubuntu deployed in a container on a centos host can have issue communicating with the host kernel. if its jsut a file unless the format the data is returnin in chagnes or the layout of sysfs changes its compatiable regardless of what you use to read it.
- existing users (NIC, crypto, SCSI, ib), mature and stable
- much better error reporting (ext_ack other than string or errno)
- namespace aware
- do not couple with kobject
Thanks
On 2020/7/20 下午6:39, Sean Mooney wrote:
On Mon, 2020-07-20 at 11:41 +0800, Jason Wang wrote:
On 2020/7/18 上午12:12, Alex Williamson wrote:
On Thu, 16 Jul 2020 16:32:30 +0800 Yan Zhao yan.y.zhao@intel.com wrote:
On Thu, Jul 16, 2020 at 12:16:26PM +0800, Jason Wang wrote:
On 2020/7/14 上午7:29, Yan Zhao wrote:
hi folks, we are defining a device migration compatibility interface that helps upper layer stack like openstack/ovirt/libvirt to check if two devices are live migration compatible. The "devices" here could be MDEVs, physical devices, or hybrid of the two. e.g. we could use it to check whether
- a src MDEV can migrate to a target MDEV,
- a src VF in SRIOV can migrate to a target VF in SRIOV,
- a src MDEV can migration to a target VF in SRIOV. (e.g. SIOV/SRIOV backward compatibility case)
The upper layer stack could use this interface as the last step to check if one device is able to migrate to another device before triggering a real live migration procedure. we are not sure if this interface is of value or help to you. please don't hesitate to drop your valuable comments.
(1) interface definition The interface is defined in below way:
__ userspace /\ \ / \write / read \ ________/__________ ___\|/_____________ | migration_version | | migration_version |-->check migration --------------------- --------------------- compatibility device A device B
a device attribute named migration_version is defined under each device's sysfs node. e.g. (/sys/bus/pci/devices/0000:00:02.0/$mdev_UUID/migration_version).
Are you aware of the devlink based device management interface that is proposed upstream? I think it has many advantages over sysfs, do you consider to switch to that?
Advantages, such as?
My understanding for devlink(netlink) over sysfs (some are mentioned at the time of vDPA sysfs mgmt API discussion) are:
i tought netlink was used more a as a configuration protocoal to qurry and confire nic and i guess other devices in its devlink form requireint a tool to be witten that can speak the protocal to interact with. the primary advantate of sysfs is that everything is just a file. there are no addtional depleenceis needed
Well, if you try to build logic like introspection on top for a sophisticated hardware, you probably need to have library on top. And it's attribute per file is pretty inefficient.
and unlike netlink there are not interoperatblity issues in a coanitnerised env. if you are using diffrenet version of libc and gcc in the contaienr vs the host my understanding is tools like ethtool from ubuntu deployed in a container on a centos host can have issue communicating with the host kernel.
Kernel provides stable ABI for userspace, so it's not something that we can't fix.
if its jsut a file unless the format the data is returnin in chagnes or the layout of sysfs changes its compatiable regardless of what you use to read it.
I believe you can't change sysfs layout which is part of uABI. But as I mentioned below, sysfs has several drawbacks. It's not harm to compare between different approach when you start a new device management API.
Thanks
- existing users (NIC, crypto, SCSI, ib), mature and stable
- much better error reporting (ext_ack other than string or errno)
- namespace aware
- do not couple with kobject
Thanks
On Fri, Jul 17, 2020 at 10:12:58AM -0600, Alex Williamson wrote: <...>
yes, in another reply, Alex proposed to use an interface in json format. I guess we can define something like
{ "self" : [ { "pciid" : "8086591d", "driver" : "i915", "gvt-version" : "v1", "mdev_type" : "i915-GVTg_V5_2", "aggregator" : "1", "pv-mode" : "none", } ], "compatible" : [ { "pciid" : "8086591d", "driver" : "i915", "gvt-version" : "v1", "mdev_type" : "i915-GVTg_V5_2", "aggregator" : "1" "pv-mode" : "none", }, { "pciid" : "8086591d", "driver" : "i915", "gvt-version" : "v1", "mdev_type" : "i915-GVTg_V5_4", "aggregator" : "2" "pv-mode" : "none", }, { "pciid" : "8086591d", "driver" : "i915", "gvt-version" : "v2", "mdev_type" : "i915-GVTg_V5_4", "aggregator" : "2" "pv-mode" : "none, ppgtt, context", } ... ] }
But as those fields are mostly vendor specific, the userspace can only do simple string comparing, I guess the list would be very long as it needs to enumerate all possible targets.
This ignores so much of what I tried to achieve in my example :(
sorry, I just was eager to show and confirm the way to list all compatible combination of mdev_type and mdev attributes.
also, in some fileds like "gvt-version", is there a simple way to express things like v2+?
That's not a reasonable thing to express anyway, how can you be certain that v3 won't break compatibility with v2? Sean proposed a versioning scheme that accounts for this, using an x.y.z version expressing the major, minor, and bugfix versions, where there is no compatibility across major versions, minor versions have forward compatibility (ex. 1 -> 2 is ok, 2 -> 1 is not) and bugfix version number indicates some degree of internal improvement that is not visible to the user in terms of features or compatibility, but provides a basis for preferring equally compatible candidates.
right. if self version is v1, it can't know its compatible version is v2. it can only be done in reverse. i.e. when self version is v2, it can list its compatible version is v1 and v2. and maybe later when self version is v3, there's no v1 in its compatible list.
In this way, do you think we still need the complex x.y.z versioning scheme?
If the userspace can read this interface both in src and target and check whether both src and target are in corresponding compatible list, I think it will work for us.
But still, kernel should not rely on userspace's choice, the opaque compatibility string is still required in kernel. No matter whether it would be exposed to userspace as an compatibility checking interface, vendor driver would keep this part of code and embed the string into the migration stream. so exposing it as an interface to be used by libvirt to do a safety check before a real live migration is only about enabling the kernel part of check to happen ahead.
As you indicate, the vendor driver is responsible for checking version information embedded within the migration stream. Therefore a migration should fail early if the devices are incompatible. Is it
but as I know, currently in VFIO migration protocol, we have no way to get vendor specific compatibility checking string in migration setup stage (i.e. .save_setup stage) before the device is set to _SAVING state. In this way, for devices who does not save device data in precopy stage, the migration compatibility checking is as late as in stop-and-copy stage, which is too late. do you think we need to add the getting/checking of vendor specific compatibility string early in save_setup stage?
really libvirt's place to second guess what it has been directed to do?
if libvirt uses the scheme of reading compatibility string at source and writing for checking at the target, it can not be called "a second guess". It's not a guess, but a confirmation.
Why would we even proceed to design a user parse-able version interface if we still have a dependency on an opaque interface? Thanks,
one reason is that libvirt can't trust the parsing result from openstack. Another reason is that libvirt can use this opaque interface easier than another parsing by itself, in the fact that it would not introduce more burden to kernel who would write this part of code anyway, no matter libvirt uses it or not.
Thanks Yan
As you indicate, the vendor driver is responsible for checking version information embedded within the migration stream. Therefore a migration should fail early if the devices are incompatible. Is it
but as I know, currently in VFIO migration protocol, we have no way to get vendor specific compatibility checking string in migration setup stage (i.e. .save_setup stage) before the device is set to _SAVING state. In this way, for devices who does not save device data in precopy stage, the migration compatibility checking is as late as in stop-and-copy stage, which is too late. do you think we need to add the getting/checking of vendor specific compatibility string early in save_setup stage?
hi Alex, after an offline discussion with Kevin, I realized that it may not be a problem if migration compatibility check in vendor driver occurs late in stop-and-copy phase for some devices, because if we report device compatibility attributes clearly in an interface, the chances for libvirt/openstack to make a wrong decision is little. so, do you think we are now arriving at an agreement that we'll give up the read-and-test scheme and start to defining one interface (perhaps in json format), from which libvirt/openstack is able to parse and find out compatibility list of a source mdev/physical device?
Thanks Yan
On Mon, 27 Jul 2020 15:24:40 +0800 Yan Zhao yan.y.zhao@intel.com wrote:
As you indicate, the vendor driver is responsible for checking version information embedded within the migration stream. Therefore a migration should fail early if the devices are incompatible. Is it
but as I know, currently in VFIO migration protocol, we have no way to get vendor specific compatibility checking string in migration setup stage (i.e. .save_setup stage) before the device is set to _SAVING state. In this way, for devices who does not save device data in precopy stage, the migration compatibility checking is as late as in stop-and-copy stage, which is too late. do you think we need to add the getting/checking of vendor specific compatibility string early in save_setup stage?
hi Alex, after an offline discussion with Kevin, I realized that it may not be a problem if migration compatibility check in vendor driver occurs late in stop-and-copy phase for some devices, because if we report device compatibility attributes clearly in an interface, the chances for libvirt/openstack to make a wrong decision is little.
I think it would be wise for a vendor driver to implement a pre-copy phase, even if only to send version information and verify it at the target. Deciding you have no device state to send during pre-copy does not mean your vendor driver needs to opt-out of the pre-copy phase entirely. Please also note that pre-copy is at the user's discretion, we've defined that we can enter stop-and-copy at any point, including without a pre-copy phase, so I would recommend that vendor drivers validate compatibility at the start of both the pre-copy and the stop-and-copy phases.
so, do you think we are now arriving at an agreement that we'll give up the read-and-test scheme and start to defining one interface (perhaps in json format), from which libvirt/openstack is able to parse and find out compatibility list of a source mdev/physical device?
Based on the feedback we've received, the previously proposed interface is not viable. I think there's agreement that the user needs to be able to parse and interpret the version information. Using json seems viable, but I don't know if it's the best option. Is there any precedent of markup strings returned via sysfs we could follow?
Your idea of having both a "self" object and an array of "compatible" objects is perhaps something we can build on, but we must not assume PCI devices at the root level of the object. Providing both the mdev-type and the driver is a bit redundant, since the former includes the latter. We can't have vendor specific versioning schemes though, ie. gvt-version. We need to agree on a common scheme and decide which fields the version is relative to, ex. just the mdev type?
I had also proposed fields that provide information to create a compatible type, for example to create a type_x2 device from a type_x1 mdev type, they need to know to apply an aggregation attribute. If we need to explicitly list every aggregation value and the resulting type, I think we run aground of what aggregation was trying to avoid anyway, so we might need to pick a language that defines variable substitution or some kind of tagging. For example if we could define ${aggr} as an integer within a specified range, then we might be able to define a type relative to that value (type_x${aggr}) which requires an aggregation attribute using the same value. I dunno, just spit balling. Thanks,
Alex
On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote:
On Mon, 27 Jul 2020 15:24:40 +0800 Yan Zhao yan.y.zhao@intel.com wrote:
As you indicate, the vendor driver is responsible for checking version information embedded within the migration stream. Therefore a migration should fail early if the devices are incompatible. Is it
but as I know, currently in VFIO migration protocol, we have no way to get vendor specific compatibility checking string in migration setup stage (i.e. .save_setup stage) before the device is set to _SAVING state. In this way, for devices who does not save device data in precopy stage, the migration compatibility checking is as late as in stop-and-copy stage, which is too late. do you think we need to add the getting/checking of vendor specific compatibility string early in save_setup stage?
hi Alex, after an offline discussion with Kevin, I realized that it may not be a problem if migration compatibility check in vendor driver occurs late in stop-and-copy phase for some devices, because if we report device compatibility attributes clearly in an interface, the chances for libvirt/openstack to make a wrong decision is little.
I think it would be wise for a vendor driver to implement a pre-copy phase, even if only to send version information and verify it at the target. Deciding you have no device state to send during pre-copy does not mean your vendor driver needs to opt-out of the pre-copy phase entirely. Please also note that pre-copy is at the user's discretion, we've defined that we can enter stop-and-copy at any point, including without a pre-copy phase, so I would recommend that vendor drivers validate compatibility at the start of both the pre-copy and the stop-and-copy phases.
ok. got it!
so, do you think we are now arriving at an agreement that we'll give up the read-and-test scheme and start to defining one interface (perhaps in json format), from which libvirt/openstack is able to parse and find out compatibility list of a source mdev/physical device?
Based on the feedback we've received, the previously proposed interface is not viable. I think there's agreement that the user needs to be able to parse and interpret the version information. Using json seems viable, but I don't know if it's the best option. Is there any precedent of markup strings returned via sysfs we could follow?
I found some examples of using formatted string under /sys, mostly under tracing. maybe we can do a similar implementation.
#cat /sys/kernel/debug/tracing/events/kvm/kvm_mmio/format
name: kvm_mmio ID: 32 format: field:unsigned short common_type; offset:0; size:2; signed:0; field:unsigned char common_flags; offset:2; size:1; signed:0; field:unsigned char common_preempt_count; offset:3; size:1; signed:0; field:int common_pid; offset:4; size:4; signed:1;
field:u32 type; offset:8; size:4; signed:0; field:u32 len; offset:12; size:4; signed:0; field:u64 gpa; offset:16; size:8; signed:0; field:u64 val; offset:24; size:8; signed:0;
print fmt: "mmio %s len %u gpa 0x%llx val 0x%llx", __print_symbolic(REC->type, { 0, "unsatisfied-read" }, { 1, "read" }, { 2, "write" }), REC->len, REC->gpa, REC->val
#cat /sys/devices/pci0000:00/0000:00:02.0/uevent DRIVER=vfio-pci PCI_CLASS=30000 PCI_ID=8086:591D PCI_SUBSYS_ID=8086:2212 PCI_SLOT_NAME=0000:00:02.0 MODALIAS=pci:v00008086d0000591Dsv00008086sd00002212bc03sc00i00
Your idea of having both a "self" object and an array of "compatible" objects is perhaps something we can build on, but we must not assume PCI devices at the root level of the object. Providing both the mdev-type and the driver is a bit redundant, since the former includes the latter. We can't have vendor specific versioning schemes though, ie. gvt-version. We need to agree on a common scheme and decide which fields the version is relative to, ex. just the mdev type?
what about making all comparing fields vendor specific? userspace like openstack only needs to parse and compare if target device is within source compatible list without understanding the meaning of each field.
I had also proposed fields that provide information to create a compatible type, for example to create a type_x2 device from a type_x1 mdev type, they need to know to apply an aggregation attribute. If we need to explicitly list every aggregation value and the resulting type, I think we run aground of what aggregation was trying to avoid anyway, so we might need to pick a language that defines variable substitution or some kind of tagging. For example if we could define ${aggr} as an integer within a specified range, then we might be able to define a type relative to that value (type_x${aggr}) which requires an aggregation attribute using the same value. I dunno, just spit balling. Thanks,
what about a migration_compatible attribute under device node like below?
#cat /sys/bus/pci/devices/0000:00:02.0/UUID1/migration_compatible SELF: device_type=pci device_id=8086591d mdev_type=i915-GVTg_V5_2 aggregator=1 pv_mode="none+ppgtt+context" interface_version=3 COMPATIBLE: device_type=pci device_id=8086591d mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8} aggregator={val1}/2 pv_mode={val2:string:"none+ppgtt","none+context","none+ppgtt+context"} interface_version={val3:int:2,3} COMPATIBLE: device_type=pci device_id=8086591d mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8} aggregator={val1}/2 pv_mode="" #"" meaning empty, could be absent in a compatible device interface_version=1
#cat /sys/bus/pci/devices/0000:00:02.0/UUID2/migration_compatible SELF: device_type=pci device_id=8086591d mdev_type=i915-GVTg_V5_4 aggregator=2 interface_version=1 COMPATIBLE: device_type=pci device_id=8086591d mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8} aggregator={val1}/2 interface_version=1
Notes: - A COMPATIBLE object is a line starting with COMPATIBLE. It specifies a list of compatible devices that are allowed to migrate in. The reason to allow multiple COMPATIBLE objects is that when it is hard to express a complex compatible logic in one COMPATIBLE object, a simple enumeration is still a fallback. in the above example, device UUID2 is in the compatible list of device UUID1, but device UUID1 is not in the compatible list of device UUID2, so device UUID2 is able to migrate to device UUID1, but device UUID1 is not able to migrate to device UUID2.
- fields under each object are of "and" relationship to each other, meaning all fields of SELF object of a target device must be equal to corresponding fields of a COMPATIBLE object of source device, otherwise it is regarded as not compatible.
- each field, however, is able to specify multiple allowed values, using variables as explained below.
- variables are represented with {}, the first appearance of one variable specifies its type and allowed list. e.g. {val1:int:1,2,4,8} represents var1 whose type is integer and allowed values are 1, 2, 4, 8.
- vendors are able to specify which fields are within the comparing list and which fields are not. e.g. for physical VF migration, it may not choose mdev_type as a comparing field, and maybe use driver name instead.
Thanks Yan
On Wed, 2020-07-29 at 16:05 +0800, Yan Zhao wrote:
On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote:
On Mon, 27 Jul 2020 15:24:40 +0800 Yan Zhao yan.y.zhao@intel.com wrote:
As you indicate, the vendor driver is responsible for checking version information embedded within the migration stream. Therefore a migration should fail early if the devices are incompatible. Is it
but as I know, currently in VFIO migration protocol, we have no way to get vendor specific compatibility checking string in migration setup stage (i.e. .save_setup stage) before the device is set to _SAVING state. In this way, for devices who does not save device data in precopy stage, the migration compatibility checking is as late as in stop-and-copy stage, which is too late. do you think we need to add the getting/checking of vendor specific compatibility string early in save_setup stage?
hi Alex, after an offline discussion with Kevin, I realized that it may not be a problem if migration compatibility check in vendor driver occurs late in stop-and-copy phase for some devices, because if we report device compatibility attributes clearly in an interface, the chances for libvirt/openstack to make a wrong decision is little.
I think it would be wise for a vendor driver to implement a pre-copy phase, even if only to send version information and verify it at the target. Deciding you have no device state to send during pre-copy does not mean your vendor driver needs to opt-out of the pre-copy phase entirely. Please also note that pre-copy is at the user's discretion, we've defined that we can enter stop-and-copy at any point, including without a pre-copy phase, so I would recommend that vendor drivers validate compatibility at the start of both the pre-copy and the stop-and-copy phases.
ok. got it!
so, do you think we are now arriving at an agreement that we'll give up the read-and-test scheme and start to defining one interface (perhaps in json format), from which libvirt/openstack is able to parse and find out compatibility list of a source mdev/physical device?
Based on the feedback we've received, the previously proposed interface is not viable. I think there's agreement that the user needs to be able to parse and interpret the version information. Using json seems viable, but I don't know if it's the best option. Is there any precedent of markup strings returned via sysfs we could follow?
I found some examples of using formatted string under /sys, mostly under tracing. maybe we can do a similar implementation.
#cat /sys/kernel/debug/tracing/events/kvm/kvm_mmio/format
name: kvm_mmio ID: 32 format: field:unsigned short common_type; offset:0; size:2; signed:0; field:unsigned char common_flags; offset:2; size:1; signed:0; field:unsigned char common_preempt_count; offset:3; size:1; signed:0; field:int common_pid; offset:4; size:4; signed:1;
field:u32 type; offset:8; size:4; signed:0; field:u32 len; offset:12; size:4; signed:0; field:u64 gpa; offset:16; size:8; signed:0; field:u64 val; offset:24; size:8; signed:0;
print fmt: "mmio %s len %u gpa 0x%llx val 0x%llx", __print_symbolic(REC->type, { 0, "unsatisfied-read" }, { 1, "read" }, { 2, "write" }), REC->len, REC->gpa, REC->val
this is not json fromat and its not supper frendly to parse.
#cat /sys/devices/pci0000:00/0000:00:02.0/uevent DRIVER=vfio-pci PCI_CLASS=30000 PCI_ID=8086:591D PCI_SUBSYS_ID=8086:2212 PCI_SLOT_NAME=0000:00:02.0 MODALIAS=pci:v00008086d0000591Dsv00008086sd00002212bc03sc00i00
this is ini format or conf formant this is pretty simple to parse whichi would be fine. that said you could also have a version or capablitiy directory with a file for each key and a singel value.
i would prefer to only have to do one read personally the list the files in directory and then read tehm all ot build the datastucture myself but that is doable though the simple ini format use d for uevent seams the best of 3 options provided above.
Your idea of having both a "self" object and an array of "compatible" objects is perhaps something we can build on, but we must not assume PCI devices at the root level of the object. Providing both the mdev-type and the driver is a bit redundant, since the former includes the latter. We can't have vendor specific versioning schemes though, ie. gvt-version. We need to agree on a common scheme and decide which fields the version is relative to, ex. just the mdev type?
what about making all comparing fields vendor specific? userspace like openstack only needs to parse and compare if target device is within source compatible list without understanding the meaning of each field.
that kind of defeats the reason for having them be be parsable. the reason openstack want to be able to understand the capablitys is so we can staticaly declare the capablit of devices ahead of time on so our schduler can select host based on that. is the keys and data are opaquce to userspace becaue they are just random vendor sepecific blobs we cant do that.
I had also proposed fields that provide information to create a compatible type, for example to create a type_x2 device from a type_x1 mdev type, they need to know to apply an aggregation attribute. If we need to explicitly list every aggregation value and the resulting type, I think we run aground of what aggregation was trying to avoid anyway, so we might need to pick a language that defines variable substitution or some kind of tagging. For example if we could define ${aggr} as an integer within a specified range, then we might be able to define a type relative to that value (type_x${aggr}) which requires an aggregation attribute using the same value. I dunno, just spit balling. Thanks,
what about a migration_compatible attribute under device node like below?
rather then listing comaptiable devices it would be better if you could declaritivly list the feature supported and we could compare those along with a simple semver version string.
#cat /sys/bus/pci/devices/0000:00:02.0/UUID1/migration_compatible SELF: device_type=pci device_id=8086591d mdev_type=i915-GVTg_V5_2 aggregator=1 pv_mode="none+ppgtt+context" interface_version=3 COMPATIBLE: device_type=pci device_id=8086591d mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8}
this mixed notation will be hard to parse so i would avoid that.
aggregator={val1}/2 pv_mode={val2:string:"none+ppgtt","none+context","none+ppgtt+context"}
interface_version={val3:int:2,3} COMPATIBLE: device_type=pci device_id=8086591d mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8} aggregator={val1}/2 pv_mode="" #"" meaning empty, could be absent in a compatible device interface_version=1
if you presented this information the only way i could see to use it would be to extract the mdev_type name and interface_vertion and build a database table as follows
source_mdev_type | source_version | target_mdev_type | target_version i915-GVTg_V5_2 | 3 | 915-GVTg_V5_{val1:int:1,2,4,8} | {val3:int:2,3} i915-GVTg_V5_2 | 3 | 915-GVTg_V5_{val1:int:1,2,4,8} | 1
this would either reuiqre use to use a post placment sechudler filter to itrospec this data base or thansform the target_mdev_type and target_version colum data into CUSTOM_* traits we apply to our placment resouce providers and we would have to prefrom multiple reuqest for each posible compatiable alternitive. if the vm has muplite mdevs this is combinatorially problmenatic as it is 1 query for each device * the number of possible compatible devices for that device.
in other word if this is just opaque data we cant ever represent it efficently in our placment service and have to fall back to an explisive post placment schdluer filter base on the db table approch.
this also ignore the fact that at present the mdev_type cannot change druing a migration so the compatiable devicve with a different mdev type would not be considerd accpetable choice in openstack. they way you select a host with a specific vgpu mdev type today is to apply a custome trait which is CUSTOM_<medev_type_goes_here> to the vGPU resouce provider and then in the flavor you request 1 allcoaton of vGPU and require the CUSTOM_<medev_type_goes_here> trait. so going form i915-GVTg_V5_2 to i915-GVTg_V5_{val1:int:1,2,4,8} would not currently be compatiable with that workflow.
#cat /sys/bus/pci/dei915-GVTg_V5_{val1:int:1,2,4,8}vices/0000:00:i915- GVTg_V5_{val1:int:1,2,4,8}2.0/UUID2/migration_compatible SELF: device_type=pci device_id=8086591d mdev_type=i915-GVTg_V5_4 aggregator=2 interface_version=1 COMPATIBLE: device_type=pci device_id=8086591d mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8} aggregator={val1}/2 interface_version=1
by the way this is closer to yaml format then it is to json but it does not align with any exsiting format i know of so that just make the representation needless hard to consume if we are going to use a markup lanag let use a standard one like yaml json or toml and not invent a new one.
Notes:
A COMPATIBLE object is a line starting with COMPATIBLE. It specifies a list of compatible devices that are allowed to migrate in. The reason to allow multiple COMPATIBLE objects is that when it is hard to express a complex compatible logic in one COMPATIBLE object, a simple enumeration is still a fallback. in the above example, device UUID2 is in the compatible list of device UUID1, but device UUID1 is not in the compatible list of device UUID2, so device UUID2 is able to migrate to device UUID1, but device UUID1 is not able to migrate to device UUID2.
fields under each object are of "and" relationship to each other, meaning all fields of SELF object of a target device must be equal to corresponding fields of a COMPATIBLE object of source device, otherwise it is regarded as not compatible.
each field, however, is able to specify multiple allowed values, using variables as explained below.
variables are represented with {}, the first appearance of one variable specifies its type and allowed list. e.g. {val1:int:1,2,4,8} represents var1 whose type is integer and allowed values are 1, 2, 4, 8.
vendors are able to specify which fields are within the comparing list and which fields are not. e.g. for physical VF migration, it may not choose mdev_type as a comparing field, and maybe use driver name instead.
this format might be useful to vendors but from a orcestrator perspecive i dont think this has value to us likely we would not use this api if it was added as it does not help us with schduling. ideally instead fo declaring which other mdev types a device is compatiable with (which could presumably change over time as new device and firmwares are released) i would prefer to see a declaritive non vendor specific api that declares the feature set provided by each mdev_type from which we can infer comaptiablity similar to cpu feature flags. for devices fo the same mdev_type name addtionally a declaritive version sting could also be used if required for addtional compatiablity checks.
Thanks Yan
On Wed, 29 Jul 2020 12:28:46 +0100 Sean Mooney smooney@redhat.com wrote:
On Wed, 2020-07-29 at 16:05 +0800, Yan Zhao wrote:
On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote:
On Mon, 27 Jul 2020 15:24:40 +0800 Yan Zhao yan.y.zhao@intel.com wrote:
As you indicate, the vendor driver is responsible for checking version information embedded within the migration stream. Therefore a migration should fail early if the devices are incompatible. Is it
but as I know, currently in VFIO migration protocol, we have no way to get vendor specific compatibility checking string in migration setup stage (i.e. .save_setup stage) before the device is set to _SAVING state. In this way, for devices who does not save device data in precopy stage, the migration compatibility checking is as late as in stop-and-copy stage, which is too late. do you think we need to add the getting/checking of vendor specific compatibility string early in save_setup stage?
hi Alex, after an offline discussion with Kevin, I realized that it may not be a problem if migration compatibility check in vendor driver occurs late in stop-and-copy phase for some devices, because if we report device compatibility attributes clearly in an interface, the chances for libvirt/openstack to make a wrong decision is little.
I think it would be wise for a vendor driver to implement a pre-copy phase, even if only to send version information and verify it at the target. Deciding you have no device state to send during pre-copy does not mean your vendor driver needs to opt-out of the pre-copy phase entirely. Please also note that pre-copy is at the user's discretion, we've defined that we can enter stop-and-copy at any point, including without a pre-copy phase, so I would recommend that vendor drivers validate compatibility at the start of both the pre-copy and the stop-and-copy phases.
ok. got it!
so, do you think we are now arriving at an agreement that we'll give up the read-and-test scheme and start to defining one interface (perhaps in json format), from which libvirt/openstack is able to parse and find out compatibility list of a source mdev/physical device?
Based on the feedback we've received, the previously proposed interface is not viable. I think there's agreement that the user needs to be able to parse and interpret the version information. Using json seems viable, but I don't know if it's the best option. Is there any precedent of markup strings returned via sysfs we could follow?
I found some examples of using formatted string under /sys, mostly under tracing. maybe we can do a similar implementation.
#cat /sys/kernel/debug/tracing/events/kvm/kvm_mmio/format
name: kvm_mmio ID: 32 format: field:unsigned short common_type; offset:0; size:2; signed:0; field:unsigned char common_flags; offset:2; size:1; signed:0; field:unsigned char common_preempt_count; offset:3; size:1; signed:0; field:int common_pid; offset:4; size:4; signed:1;
field:u32 type; offset:8; size:4; signed:0; field:u32 len; offset:12; size:4; signed:0; field:u64 gpa; offset:16; size:8; signed:0; field:u64 val; offset:24; size:8; signed:0;
print fmt: "mmio %s len %u gpa 0x%llx val 0x%llx", __print_symbolic(REC->type, { 0, "unsatisfied-read" }, { 1, "read" }, { 2, "write" }), REC->len, REC->gpa, REC->val
this is not json fromat and its not supper frendly to parse.
#cat /sys/devices/pci0000:00/0000:00:02.0/uevent DRIVER=vfio-pci PCI_CLASS=30000 PCI_ID=8086:591D PCI_SUBSYS_ID=8086:2212 PCI_SLOT_NAME=0000:00:02.0 MODALIAS=pci:v00008086d0000591Dsv00008086sd00002212bc03sc00i00
this is ini format or conf formant this is pretty simple to parse whichi would be fine. that said you could also have a version or capablitiy directory with a file for each key and a singel value.
i would prefer to only have to do one read personally the list the files in directory and then read tehm all ot build the datastucture myself but that is doable though the simple ini format use d for uevent seams the best of 3 options provided above.
Your idea of having both a "self" object and an array of "compatible" objects is perhaps something we can build on, but we must not assume PCI devices at the root level of the object. Providing both the mdev-type and the driver is a bit redundant, since the former includes the latter. We can't have vendor specific versioning schemes though, ie. gvt-version. We need to agree on a common scheme and decide which fields the version is relative to, ex. just the mdev type?
what about making all comparing fields vendor specific? userspace like openstack only needs to parse and compare if target device is within source compatible list without understanding the meaning of each field.
that kind of defeats the reason for having them be be parsable. the reason openstack want to be able to understand the capablitys is so we can staticaly declare the capablit of devices ahead of time on so our schduler can select host based on that. is the keys and data are opaquce to userspace becaue they are just random vendor sepecific blobs we cant do that.
Agreed, I'm not sure I'm willing to rule out that there could be vendor specific direct match fields, as I included in my example earlier in the thread, but entirely vendor specific defeats much of the purpose here.
I had also proposed fields that provide information to create a compatible type, for example to create a type_x2 device from a type_x1 mdev type, they need to know to apply an aggregation attribute. If we need to explicitly list every aggregation value and the resulting type, I think we run aground of what aggregation was trying to avoid anyway, so we might need to pick a language that defines variable substitution or some kind of tagging. For example if we could define ${aggr} as an integer within a specified range, then we might be able to define a type relative to that value (type_x${aggr}) which requires an aggregation attribute using the same value. I dunno, just spit balling. Thanks,
what about a migration_compatible attribute under device node like below?
rather then listing comaptiable devices it would be better if you could declaritivly list the feature supported and we could compare those along with a simple semver version string.
#cat /sys/bus/pci/devices/0000:00:02.0/UUID1/migration_compatible
Note that we're defining compatibility relative to a vfio migration interface, so we should include that in the name, we don't know what other migration interfaces might exist.
SELF: device_type=pci
Why not the device_api here, ie. vfio-pci. The device doesn't provide a pci interface directly, it's wrapped in a vfio API.
device_id=8086591d
Is device_id interpreted relative to device_type? How does this relate to mdev_type? If we have an mdev_type, doesn't that fully defined the software API?
mdev_type=i915-GVTg_V5_2
And how are non-mdev devices represented?
aggregator=1 pv_mode="none+ppgtt+context"
These are meaningless vendor specific matches afaict.
interface_version=3
Not much granularity here, I prefer Sean's previous <major>.<minor>[.bugfix] scheme.
COMPATIBLE: device_type=pci device_id=8086591d mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8}
this mixed notation will be hard to parse so i would avoid that.
Some background, Intel has been proposing aggregation as a solution to how we scale mdev devices when hardware exposes large numbers of assignable objects that can be composed in essentially arbitrary ways. So for instance, if we have a workqueue (wq), we might have an mdev type for 1wq, 2wq, 3wq,... Nwq. It's not really practical to expose a discrete mdev type for each of those, so they want to define a base type which is composable to other types via this aggregation. This is what this substitution and tagging is attempting to accomplish. So imagine this set of values for cases where it's not practical to unroll the values for N discrete types.
aggregator={val1}/2
So the {val1} above would be substituted here, though an aggregation factor of 1/2 is a head scratcher...
pv_mode={val2:string:"none+ppgtt","none+context","none+ppgtt+context"}
I'm lost on this one though. I think maybe it's indicating that it's compatible with any of these, so do we need to list it? Couldn't this be handled by Sean's version proposal where the minor version represents feature compatibility?
interface_version={val3:int:2,3}
What does this turn into in a few years, 2,7,12,23,75,96,...
COMPATIBLE: device_type=pci device_id=8086591d mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8} aggregator={val1}/2 pv_mode="" #"" meaning empty, could be absent in a compatible device interface_version=1
Why can't this be represented within the previous compatible description?
if you presented this information the only way i could see to use it would be to extract the mdev_type name and interface_vertion and build a database table as follows
source_mdev_type | source_version | target_mdev_type | target_version i915-GVTg_V5_2 | 3 | 915-GVTg_V5_{val1:int:1,2,4,8} | {val3:int:2,3} i915-GVTg_V5_2 | 3 | 915-GVTg_V5_{val1:int:1,2,4,8} | 1
this would either reuiqre use to use a post placment sechudler filter to itrospec this data base or thansform the target_mdev_type and target_version colum data into CUSTOM_* traits we apply to our placment resouce providers and we would have to prefrom multiple reuqest for each posible compatiable alternitive. if the vm has muplite mdevs this is combinatorially problmenatic as it is 1 query for each device * the number of possible compatible devices for that device.
in other word if this is just opaque data we cant ever represent it efficently in our placment service and have to fall back to an explisive post placment schdluer filter base on the db table approch.
this also ignore the fact that at present the mdev_type cannot change druing a migration so the compatiable devicve with a different mdev type would not be considerd accpetable choice in openstack. they way you select a host with a specific vgpu mdev type today is to apply a custome trait which is CUSTOM_<medev_type_goes_here> to the vGPU resouce provider and then in the flavor you request 1 allcoaton of vGPU and require the CUSTOM_<medev_type_goes_here> trait. so going form i915-GVTg_V5_2 to i915-GVTg_V5_{val1:int:1,2,4,8} would not currently be compatiable with that workflow.
The latter would need to be parsed into:
i915-GVTg_V5_1 i915-GVTg_V5_2 i915-GVTg_V5_4 i915-GVTg_V5_8
There is also on the table, migration from physical devices to mdev devices (or vice versa), which is not represented in these examples, nor do I see how we'd represent it. This is where I started exposing the resulting PCI device from the mdev in my example so we could have some commonality between devices, but the migration stream provider is just as important as the type of device, we could have different host drivers providing the same device with incompatible migration streams. The mdev_type encompasses both the driver and device, but we wouldn't have mdev_types for physical devices, per our current thinking.
#cat /sys/bus/pci/dei915-GVTg_V5_{val1:int:1,2,4,8}vices/0000:00:i915- GVTg_V5_{val1:int:1,2,4,8}2.0/UUID2/migration_compatible SELF: device_type=pci device_id=8086591d mdev_type=i915-GVTg_V5_4 aggregator=2 interface_version=1 COMPATIBLE: device_type=pci device_id=8086591d mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8} aggregator={val1}/2 interface_version=1
by the way this is closer to yaml format then it is to json but it does not align with any exsiting format i know of so that just make the representation needless hard to consume if we are going to use a markup lanag let use a standard one like yaml json or toml and not invent a new one.
Notes:
A COMPATIBLE object is a line starting with COMPATIBLE. It specifies a list of compatible devices that are allowed to migrate in. The reason to allow multiple COMPATIBLE objects is that when it is hard to express a complex compatible logic in one COMPATIBLE object, a simple enumeration is still a fallback. in the above example, device UUID2 is in the compatible list of device UUID1, but device UUID1 is not in the compatible list of device UUID2, so device UUID2 is able to migrate to device UUID1, but device UUID1 is not able to migrate to device UUID2.
fields under each object are of "and" relationship to each other, meaning all fields of SELF object of a target device must be equal to corresponding fields of a COMPATIBLE object of source device, otherwise it is regarded as not compatible.
each field, however, is able to specify multiple allowed values, using variables as explained below.
variables are represented with {}, the first appearance of one variable specifies its type and allowed list. e.g. {val1:int:1,2,4,8} represents var1 whose type is integer and allowed values are 1, 2, 4, 8.
vendors are able to specify which fields are within the comparing list and which fields are not. e.g. for physical VF migration, it may not choose mdev_type as a comparing field, and maybe use driver name instead.
this format might be useful to vendors but from a orcestrator perspecive i dont think this has value to us likely we would not use this api if it was added as it does not help us with schduling. ideally instead fo declaring which other mdev types a device is compatiable with (which could presumably change over time as new device and firmwares are released) i would prefer to see a declaritive non vendor specific api that declares the feature set provided by each mdev_type from which we can infer comaptiablity similar to cpu feature flags. for devices fo the same mdev_type name addtionally a declaritive version sting could also be used if required for addtional compatiablity checks.
"non vendor specific api that declares the feature set", aren't features generally vendor specific? What we're trying to describe is, by it's very nature, vendor specific. We don't have an ISO body defining a graphics adapter and enumerating features for that adapter. I think what we have is mdev_types. Each type is supposed to define a specific software interface, perhaps even more so than is done by a PCI vendor:device ID. Maybe that mdev_type needs to be abstracted as something more like a vendor signature, such that a physical device could provide or accept a vendor signature that's compatible with an mdev device. For example, a physically assigned Intel GPU might expose a migration signature of i915-GVTg_v5_8 if it were designed to be compatible with that mdev_type. Thanks,
Alex
On Wed, Jul 29, 2020 at 01:12:55PM -0600, Alex Williamson wrote:
On Wed, 29 Jul 2020 12:28:46 +0100 Sean Mooney smooney@redhat.com wrote:
On Wed, 2020-07-29 at 16:05 +0800, Yan Zhao wrote:
On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote:
On Mon, 27 Jul 2020 15:24:40 +0800 Yan Zhao yan.y.zhao@intel.com wrote:
> As you indicate, the vendor driver is responsible for checking version > information embedded within the migration stream. Therefore a > migration should fail early if the devices are incompatible. Is it
but as I know, currently in VFIO migration protocol, we have no way to get vendor specific compatibility checking string in migration setup stage (i.e. .save_setup stage) before the device is set to _SAVING state. In this way, for devices who does not save device data in precopy stage, the migration compatibility checking is as late as in stop-and-copy stage, which is too late. do you think we need to add the getting/checking of vendor specific compatibility string early in save_setup stage?
hi Alex, after an offline discussion with Kevin, I realized that it may not be a problem if migration compatibility check in vendor driver occurs late in stop-and-copy phase for some devices, because if we report device compatibility attributes clearly in an interface, the chances for libvirt/openstack to make a wrong decision is little.
I think it would be wise for a vendor driver to implement a pre-copy phase, even if only to send version information and verify it at the target. Deciding you have no device state to send during pre-copy does not mean your vendor driver needs to opt-out of the pre-copy phase entirely. Please also note that pre-copy is at the user's discretion, we've defined that we can enter stop-and-copy at any point, including without a pre-copy phase, so I would recommend that vendor drivers validate compatibility at the start of both the pre-copy and the stop-and-copy phases.
ok. got it!
so, do you think we are now arriving at an agreement that we'll give up the read-and-test scheme and start to defining one interface (perhaps in json format), from which libvirt/openstack is able to parse and find out compatibility list of a source mdev/physical device?
Based on the feedback we've received, the previously proposed interface is not viable. I think there's agreement that the user needs to be able to parse and interpret the version information. Using json seems viable, but I don't know if it's the best option. Is there any precedent of markup strings returned via sysfs we could follow?
I found some examples of using formatted string under /sys, mostly under tracing. maybe we can do a similar implementation.
#cat /sys/kernel/debug/tracing/events/kvm/kvm_mmio/format
name: kvm_mmio ID: 32 format: field:unsigned short common_type; offset:0; size:2; signed:0; field:unsigned char common_flags; offset:2; size:1; signed:0; field:unsigned char common_preempt_count; offset:3; size:1; signed:0; field:int common_pid; offset:4; size:4; signed:1;
field:u32 type; offset:8; size:4; signed:0; field:u32 len; offset:12; size:4; signed:0; field:u64 gpa; offset:16; size:8; signed:0; field:u64 val; offset:24; size:8; signed:0;
print fmt: "mmio %s len %u gpa 0x%llx val 0x%llx", __print_symbolic(REC->type, { 0, "unsatisfied-read" }, { 1, "read" }, { 2, "write" }), REC->len, REC->gpa, REC->val
this is not json fromat and its not supper frendly to parse.
#cat /sys/devices/pci0000:00/0000:00:02.0/uevent DRIVER=vfio-pci PCI_CLASS=30000 PCI_ID=8086:591D PCI_SUBSYS_ID=8086:2212 PCI_SLOT_NAME=0000:00:02.0 MODALIAS=pci:v00008086d0000591Dsv00008086sd00002212bc03sc00i00
this is ini format or conf formant this is pretty simple to parse whichi would be fine. that said you could also have a version or capablitiy directory with a file for each key and a singel value.
i would prefer to only have to do one read personally the list the files in directory and then read tehm all ot build the datastucture myself but that is doable though the simple ini format use d for uevent seams the best of 3 options provided above.
Your idea of having both a "self" object and an array of "compatible" objects is perhaps something we can build on, but we must not assume PCI devices at the root level of the object. Providing both the mdev-type and the driver is a bit redundant, since the former includes the latter. We can't have vendor specific versioning schemes though, ie. gvt-version. We need to agree on a common scheme and decide which fields the version is relative to, ex. just the mdev type?
what about making all comparing fields vendor specific? userspace like openstack only needs to parse and compare if target device is within source compatible list without understanding the meaning of each field.
that kind of defeats the reason for having them be be parsable. the reason openstack want to be able to understand the capablitys is so we can staticaly declare the capablit of devices ahead of time on so our schduler can select host based on that. is the keys and data are opaquce to userspace becaue they are just random vendor sepecific blobs we cant do that.
Agreed, I'm not sure I'm willing to rule out that there could be vendor specific direct match fields, as I included in my example earlier in the thread, but entirely vendor specific defeats much of the purpose here.
I had also proposed fields that provide information to create a compatible type, for example to create a type_x2 device from a type_x1 mdev type, they need to know to apply an aggregation attribute. If we need to explicitly list every aggregation value and the resulting type, I think we run aground of what aggregation was trying to avoid anyway, so we might need to pick a language that defines variable substitution or some kind of tagging. For example if we could define ${aggr} as an integer within a specified range, then we might be able to define a type relative to that value (type_x${aggr}) which requires an aggregation attribute using the same value. I dunno, just spit balling. Thanks,
what about a migration_compatible attribute under device node like below?
rather then listing comaptiable devices it would be better if you could declaritivly list the feature supported and we could compare those along with a simple semver version string.
#cat /sys/bus/pci/devices/0000:00:02.0/UUID1/migration_compatible
Note that we're defining compatibility relative to a vfio migration interface, so we should include that in the name, we don't know what other migration interfaces might exist.
do you mean we need to name it as vfio_migration, e.g. /sys/bus/pci/devices/0000:00:02.0/UUID1/vfio_migration ?
SELF: device_type=pci
Why not the device_api here, ie. vfio-pci. The device doesn't provide a pci interface directly, it's wrapped in a vfio API.
the device_type is to indicate below device_id is a pci id.
yes, include a device_api field is better. for mdev, "device_type=vfio-mdev", is it right?
device_id=8086591d
Is device_id interpreted relative to device_type? How does this relate to mdev_type? If we have an mdev_type, doesn't that fully defined the software API?
it's parent pci id for mdev actually.
mdev_type=i915-GVTg_V5_2
And how are non-mdev devices represented?
non-mdev can opt to not include this field, or as you said below, a vendor signature.
aggregator=1 pv_mode="none+ppgtt+context"
These are meaningless vendor specific matches afaict.
yes, pv_mode and aggregator are vendor specific fields. but they are important to decide whether two devices are compatible. pv_mode means whether a vGPU supports guest paravirtualized api. "none+ppgtt+context" means guest can not use pv, or use ppgtt mode pv or use context mode pv.
interface_version=3
Not much granularity here, I prefer Sean's previous <major>.<minor>[.bugfix] scheme.
yes, <major>.<minor>[.bugfix] scheme may be better, but I'm not sure if it works for a complicated scenario. e.g for pv_mode, (1) initially, pv_mode is not supported, so it's pv_mode=none, it's 0.0.0, (2) then, pv_mode=ppgtt is supported, pv_mode="none+ppgtt", it's 0.1.0, indicating pv_mode=none can migrate to pv_mode="none+ppgtt", but not vice versa. (3) later, pv_mode=context is also supported, pv_mode="none+ppgtt+context", so it's 0.2.0.
But if later, pv_mode=ppgtt is removed. pv_mode="none+context", how to name its version? "none+ppgtt" (0.1.0) is not compatible to "none+context", but "none+ppgtt+context" (0.2.0) is compatible to "none+context".
Maintain such scheme is painful to vendor driver.
COMPATIBLE: device_type=pci device_id=8086591d mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8}
this mixed notation will be hard to parse so i would avoid that.
Some background, Intel has been proposing aggregation as a solution to how we scale mdev devices when hardware exposes large numbers of assignable objects that can be composed in essentially arbitrary ways. So for instance, if we have a workqueue (wq), we might have an mdev type for 1wq, 2wq, 3wq,... Nwq. It's not really practical to expose a discrete mdev type for each of those, so they want to define a base type which is composable to other types via this aggregation. This is what this substitution and tagging is attempting to accomplish. So imagine this set of values for cases where it's not practical to unroll the values for N discrete types.
aggregator={val1}/2
So the {val1} above would be substituted here, though an aggregation factor of 1/2 is a head scratcher...
pv_mode={val2:string:"none+ppgtt","none+context","none+ppgtt+context"}
I'm lost on this one though. I think maybe it's indicating that it's compatible with any of these, so do we need to list it? Couldn't this be handled by Sean's version proposal where the minor version represents feature compatibility?
yes, it's indicating that it's compatible with any of these. Sean's version proposal may also work, but it would be painful for vendor driver to maintain the versions when multiple similar features are involved.
interface_version={val3:int:2,3}
What does this turn into in a few years, 2,7,12,23,75,96,...
is a range better?
COMPATIBLE: device_type=pci device_id=8086591d mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8} aggregator={val1}/2 pv_mode="" #"" meaning empty, could be absent in a compatible device interface_version=1
Why can't this be represented within the previous compatible description?
actually it can be merged with the previous one :) But I guess there must be one that cannot merge, so put it as an example to demo multiple COMPATIBLE objects.
Thanks Yan
if you presented this information the only way i could see to use it would be to extract the mdev_type name and interface_vertion and build a database table as follows
source_mdev_type | source_version | target_mdev_type | target_version i915-GVTg_V5_2 | 3 | 915-GVTg_V5_{val1:int:1,2,4,8} | {val3:int:2,3} i915-GVTg_V5_2 | 3 | 915-GVTg_V5_{val1:int:1,2,4,8} | 1
this would either reuiqre use to use a post placment sechudler filter to itrospec this data base or thansform the target_mdev_type and target_version colum data into CUSTOM_* traits we apply to our placment resouce providers and we would have to prefrom multiple reuqest for each posible compatiable alternitive. if the vm has muplite mdevs this is combinatorially problmenatic as it is 1 query for each device * the number of possible compatible devices for that device.
in other word if this is just opaque data we cant ever represent it efficently in our placment service and have to fall back to an explisive post placment schdluer filter base on the db table approch.
this also ignore the fact that at present the mdev_type cannot change druing a migration so the compatiable devicve with a different mdev type would not be considerd accpetable choice in openstack. they way you select a host with a specific vgpu mdev type today is to apply a custome trait which is CUSTOM_<medev_type_goes_here> to the vGPU resouce provider and then in the flavor you request 1 allcoaton of vGPU and require the CUSTOM_<medev_type_goes_here> trait. so going form i915-GVTg_V5_2 to i915-GVTg_V5_{val1:int:1,2,4,8} would not currently be compatiable with that workflow.
The latter would need to be parsed into:
i915-GVTg_V5_1 i915-GVTg_V5_2 i915-GVTg_V5_4 i915-GVTg_V5_8
There is also on the table, migration from physical devices to mdev devices (or vice versa), which is not represented in these examples, nor do I see how we'd represent it. This is where I started exposing the resulting PCI device from the mdev in my example so we could have some commonality between devices, but the migration stream provider is just as important as the type of device, we could have different host drivers providing the same device with incompatible migration streams. The mdev_type encompasses both the driver and device, but we wouldn't have mdev_types for physical devices, per our current thinking.
#cat /sys/bus/pci/dei915-GVTg_V5_{val1:int:1,2,4,8}vices/0000:00:i915- GVTg_V5_{val1:int:1,2,4,8}2.0/UUID2/migration_compatible SELF: device_type=pci device_id=8086591d mdev_type=i915-GVTg_V5_4 aggregator=2 interface_version=1 COMPATIBLE: device_type=pci device_id=8086591d mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8} aggregator={val1}/2 interface_version=1
by the way this is closer to yaml format then it is to json but it does not align with any exsiting format i know of so that just make the representation needless hard to consume if we are going to use a markup lanag let use a standard one like yaml json or toml and not invent a new one.
Notes:
A COMPATIBLE object is a line starting with COMPATIBLE. It specifies a list of compatible devices that are allowed to migrate in. The reason to allow multiple COMPATIBLE objects is that when it is hard to express a complex compatible logic in one COMPATIBLE object, a simple enumeration is still a fallback. in the above example, device UUID2 is in the compatible list of device UUID1, but device UUID1 is not in the compatible list of device UUID2, so device UUID2 is able to migrate to device UUID1, but device UUID1 is not able to migrate to device UUID2.
fields under each object are of "and" relationship to each other, meaning all fields of SELF object of a target device must be equal to corresponding fields of a COMPATIBLE object of source device, otherwise it is regarded as not compatible.
each field, however, is able to specify multiple allowed values, using variables as explained below.
variables are represented with {}, the first appearance of one variable specifies its type and allowed list. e.g. {val1:int:1,2,4,8} represents var1 whose type is integer and allowed values are 1, 2, 4, 8.
vendors are able to specify which fields are within the comparing list and which fields are not. e.g. for physical VF migration, it may not choose mdev_type as a comparing field, and maybe use driver name instead.
this format might be useful to vendors but from a orcestrator perspecive i dont think this has value to us likely we would not use this api if it was added as it does not help us with schduling. ideally instead fo declaring which other mdev types a device is compatiable with (which could presumably change over time as new device and firmwares are released) i would prefer to see a declaritive non vendor specific api that declares the feature set provided by each mdev_type from which we can infer comaptiablity similar to cpu feature flags. for devices fo the same mdev_type name addtionally a declaritive version sting could also be used if required for addtional compatiablity checks.
"non vendor specific api that declares the feature set", aren't features generally vendor specific? What we're trying to describe is, by it's very nature, vendor specific. We don't have an ISO body defining a graphics adapter and enumerating features for that adapter. I think what we have is mdev_types. Each type is supposed to define a specific software interface, perhaps even more so than is done by a PCI vendor:device ID. Maybe that mdev_type needs to be abstracted as something more like a vendor signature, such that a physical device could provide or accept a vendor signature that's compatible with an mdev device. For example, a physically assigned Intel GPU might expose a migration signature of i915-GVTg_v5_8 if it were designed to be compatible with that mdev_type. Thanks,
Alex
On Thu, 2020-07-30 at 11:41 +0800, Yan Zhao wrote:
interface_version=3
Not much granularity here, I prefer Sean's previous <major>.<minor>[.bugfix] scheme.
yes, <major>.<minor>[.bugfix] scheme may be better, but I'm not sure if it works for a complicated scenario. e.g for pv_mode, (1) initially, pv_mode is not supported, so it's pv_mode=none, it's 0.0.0, (2) then, pv_mode=ppgtt is supported, pv_mode="none+ppgtt", it's 0.1.0, indicating pv_mode=none can migrate to pv_mode="none+ppgtt", but not vice versa. (3) later, pv_mode=context is also supported, pv_mode="none+ppgtt+context", so it's 0.2.0.
But if later, pv_mode=ppgtt is removed. pv_mode="none+context", how to name its version?
it would become 1.0.0 addtion of a feature is a minor version bump as its backwards compatiable. if you dont request the new feature you dont need to use it and it can continue to behave like a 0.0.0 device evne if its capably of acting as a 0.1.0 device. when you remove a feature that is backward incompatable as any isnstance that was prevously not using it would nolonger work so you have to bump the major version.
"none+ppgtt" (0.1.0) is not compatible to "none+context", but "none+ppgtt+context" (0.2.0) is compatible to "none+context".
Maintain such scheme is painful to vendor driver.
not really its how most software libs are version today. some use other schemes but semantic versioning is don right is a concies and easy to consume set of rules https://semver.org/ however you are right that it forcnes vendor to think about backwards and forwards compatiablty with each change which for the most part is a good thing. it goes hand in hand with have stable abi and api definitons to ensuring firmware updates and driver chagnes dont break userspace that depend on the kernel interfaces they expose.
On Thu, 30 Jul 2020 11:41:04 +0800 Yan Zhao yan.y.zhao@intel.com wrote:
On Wed, Jul 29, 2020 at 01:12:55PM -0600, Alex Williamson wrote:
On Wed, 29 Jul 2020 12:28:46 +0100 Sean Mooney smooney@redhat.com wrote:
On Wed, 2020-07-29 at 16:05 +0800, Yan Zhao wrote:
On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote:
On Mon, 27 Jul 2020 15:24:40 +0800 Yan Zhao yan.y.zhao@intel.com wrote:
> > As you indicate, the vendor driver is responsible for checking version > > information embedded within the migration stream. Therefore a > > migration should fail early if the devices are incompatible. Is it > > but as I know, currently in VFIO migration protocol, we have no way to > get vendor specific compatibility checking string in migration setup stage > (i.e. .save_setup stage) before the device is set to _SAVING state. > In this way, for devices who does not save device data in precopy stage, > the migration compatibility checking is as late as in stop-and-copy > stage, which is too late. > do you think we need to add the getting/checking of vendor specific > compatibility string early in save_setup stage? >
hi Alex, after an offline discussion with Kevin, I realized that it may not be a problem if migration compatibility check in vendor driver occurs late in stop-and-copy phase for some devices, because if we report device compatibility attributes clearly in an interface, the chances for libvirt/openstack to make a wrong decision is little.
I think it would be wise for a vendor driver to implement a pre-copy phase, even if only to send version information and verify it at the target. Deciding you have no device state to send during pre-copy does not mean your vendor driver needs to opt-out of the pre-copy phase entirely. Please also note that pre-copy is at the user's discretion, we've defined that we can enter stop-and-copy at any point, including without a pre-copy phase, so I would recommend that vendor drivers validate compatibility at the start of both the pre-copy and the stop-and-copy phases.
ok. got it!
so, do you think we are now arriving at an agreement that we'll give up the read-and-test scheme and start to defining one interface (perhaps in json format), from which libvirt/openstack is able to parse and find out compatibility list of a source mdev/physical device?
Based on the feedback we've received, the previously proposed interface is not viable. I think there's agreement that the user needs to be able to parse and interpret the version information. Using json seems viable, but I don't know if it's the best option. Is there any precedent of markup strings returned via sysfs we could follow?
I found some examples of using formatted string under /sys, mostly under tracing. maybe we can do a similar implementation.
#cat /sys/kernel/debug/tracing/events/kvm/kvm_mmio/format
name: kvm_mmio ID: 32 format: field:unsigned short common_type; offset:0; size:2; signed:0; field:unsigned char common_flags; offset:2; size:1; signed:0; field:unsigned char common_preempt_count; offset:3; size:1; signed:0; field:int common_pid; offset:4; size:4; signed:1;
field:u32 type; offset:8; size:4; signed:0; field:u32 len; offset:12; size:4; signed:0; field:u64 gpa; offset:16; size:8; signed:0; field:u64 val; offset:24; size:8; signed:0;
print fmt: "mmio %s len %u gpa 0x%llx val 0x%llx", __print_symbolic(REC->type, { 0, "unsatisfied-read" }, { 1, "read" }, { 2, "write" }), REC->len, REC->gpa, REC->val
this is not json fromat and its not supper frendly to parse.
#cat /sys/devices/pci0000:00/0000:00:02.0/uevent DRIVER=vfio-pci PCI_CLASS=30000 PCI_ID=8086:591D PCI_SUBSYS_ID=8086:2212 PCI_SLOT_NAME=0000:00:02.0 MODALIAS=pci:v00008086d0000591Dsv00008086sd00002212bc03sc00i00
this is ini format or conf formant this is pretty simple to parse whichi would be fine. that said you could also have a version or capablitiy directory with a file for each key and a singel value.
i would prefer to only have to do one read personally the list the files in directory and then read tehm all ot build the datastucture myself but that is doable though the simple ini format use d for uevent seams the best of 3 options provided above.
Your idea of having both a "self" object and an array of "compatible" objects is perhaps something we can build on, but we must not assume PCI devices at the root level of the object. Providing both the mdev-type and the driver is a bit redundant, since the former includes the latter. We can't have vendor specific versioning schemes though, ie. gvt-version. We need to agree on a common scheme and decide which fields the version is relative to, ex. just the mdev type?
what about making all comparing fields vendor specific? userspace like openstack only needs to parse and compare if target device is within source compatible list without understanding the meaning of each field.
that kind of defeats the reason for having them be be parsable. the reason openstack want to be able to understand the capablitys is so we can staticaly declare the capablit of devices ahead of time on so our schduler can select host based on that. is the keys and data are opaquce to userspace becaue they are just random vendor sepecific blobs we cant do that.
Agreed, I'm not sure I'm willing to rule out that there could be vendor specific direct match fields, as I included in my example earlier in the thread, but entirely vendor specific defeats much of the purpose here.
I had also proposed fields that provide information to create a compatible type, for example to create a type_x2 device from a type_x1 mdev type, they need to know to apply an aggregation attribute. If we need to explicitly list every aggregation value and the resulting type, I think we run aground of what aggregation was trying to avoid anyway, so we might need to pick a language that defines variable substitution or some kind of tagging. For example if we could define ${aggr} as an integer within a specified range, then we might be able to define a type relative to that value (type_x${aggr}) which requires an aggregation attribute using the same value. I dunno, just spit balling. Thanks,
what about a migration_compatible attribute under device node like below?
rather then listing comaptiable devices it would be better if you could declaritivly list the feature supported and we could compare those along with a simple semver version string.
#cat /sys/bus/pci/devices/0000:00:02.0/UUID1/migration_compatible
Note that we're defining compatibility relative to a vfio migration interface, so we should include that in the name, we don't know what other migration interfaces might exist.
do you mean we need to name it as vfio_migration, e.g. /sys/bus/pci/devices/0000:00:02.0/UUID1/vfio_migration ?
SELF: device_type=pci
Why not the device_api here, ie. vfio-pci. The device doesn't provide a pci interface directly, it's wrapped in a vfio API.
the device_type is to indicate below device_id is a pci id.
yes, include a device_api field is better. for mdev, "device_type=vfio-mdev", is it right?
No, vfio-mdev is not a device API, it's the driver that attaches to the mdev bus device to expose it through vfio. The device_api exposes the actual interface of the vfio device, it's also vfio-pci for typical mdev devices found on x86, but may be vfio-ccw, vfio-ap, etc... See VFIO_DEVICE_API_PCI_STRING and friends.
device_id=8086591d
Is device_id interpreted relative to device_type? How does this relate to mdev_type? If we have an mdev_type, doesn't that fully defined the software API?
it's parent pci id for mdev actually.
If we need to specify the parent PCI ID then something is fundamentally wrong with the mdev_type. The mdev_type should define a unique, software compatible interface, regardless of the parent device IDs. If a i915-GVTg_V5_2 means different things based on the parent device IDs, then then different mdev_types should be reported for those parent devices.
mdev_type=i915-GVTg_V5_2
And how are non-mdev devices represented?
non-mdev can opt to not include this field, or as you said below, a vendor signature.
aggregator=1 pv_mode="none+ppgtt+context"
These are meaningless vendor specific matches afaict.
yes, pv_mode and aggregator are vendor specific fields. but they are important to decide whether two devices are compatible. pv_mode means whether a vGPU supports guest paravirtualized api. "none+ppgtt+context" means guest can not use pv, or use ppgtt mode pv or use context mode pv.
interface_version=3
Not much granularity here, I prefer Sean's previous <major>.<minor>[.bugfix] scheme.
yes, <major>.<minor>[.bugfix] scheme may be better, but I'm not sure if it works for a complicated scenario. e.g for pv_mode, (1) initially, pv_mode is not supported, so it's pv_mode=none, it's 0.0.0, (2) then, pv_mode=ppgtt is supported, pv_mode="none+ppgtt", it's 0.1.0, indicating pv_mode=none can migrate to pv_mode="none+ppgtt", but not vice versa. (3) later, pv_mode=context is also supported, pv_mode="none+ppgtt+context", so it's 0.2.0.
But if later, pv_mode=ppgtt is removed. pv_mode="none+context", how to name its version? "none+ppgtt" (0.1.0) is not compatible to "none+context", but "none+ppgtt+context" (0.2.0) is compatible to "none+context".
If pv_mode=ppgtt is removed, then the compatible versions would be 0.0.0 or 1.0.0, ie. the major version would be incremented due to feature removal.
Maintain such scheme is painful to vendor driver.
Migration compatibility is painful, there's no way around that. I think the version scheme is an attempt to push some of that low level burden on the vendor driver, otherwise the management tools need to work on an ever growing matrix of vendor specific features which is going to become unwieldy and is largely meaningless outside of the vendor driver. Instead, the vendor driver can make strategic decisions about where to continue to maintain a support burden and make explicit decisions to maintain or break compatibility. The version scheme is a simplification and abstraction of vendor driver features in order to create a small, logical compatibility matrix. Compromises necessarily need to be made for that to occur.
COMPATIBLE: device_type=pci device_id=8086591d mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8}
this mixed notation will be hard to parse so i would avoid that.
Some background, Intel has been proposing aggregation as a solution to how we scale mdev devices when hardware exposes large numbers of assignable objects that can be composed in essentially arbitrary ways. So for instance, if we have a workqueue (wq), we might have an mdev type for 1wq, 2wq, 3wq,... Nwq. It's not really practical to expose a discrete mdev type for each of those, so they want to define a base type which is composable to other types via this aggregation. This is what this substitution and tagging is attempting to accomplish. So imagine this set of values for cases where it's not practical to unroll the values for N discrete types.
aggregator={val1}/2
So the {val1} above would be substituted here, though an aggregation factor of 1/2 is a head scratcher...
pv_mode={val2:string:"none+ppgtt","none+context","none+ppgtt+context"}
I'm lost on this one though. I think maybe it's indicating that it's compatible with any of these, so do we need to list it? Couldn't this be handled by Sean's version proposal where the minor version represents feature compatibility?
yes, it's indicating that it's compatible with any of these. Sean's version proposal may also work, but it would be painful for vendor driver to maintain the versions when multiple similar features are involved.
This is something vendor drivers need to consider when adding and removing features.
interface_version={val3:int:2,3}
What does this turn into in a few years, 2,7,12,23,75,96,...
is a range better?
I was really trying to point out that sparseness becomes an issue if the vendor driver is largely disconnected from how their feature addition and deprecation affects migration support. Thanks,
Alex
COMPATIBLE: device_type=pci device_id=8086591d mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8} aggregator={val1}/2 pv_mode="" #"" meaning empty, could be absent in a compatible device interface_version=1
Why can't this be represented within the previous compatible description?
actually it can be merged with the previous one :) But I guess there must be one that cannot merge, so put it as an example to demo multiple COMPATIBLE objects.
Thanks Yan
if you presented this information the only way i could see to use it would be to extract the mdev_type name and interface_vertion and build a database table as follows
source_mdev_type | source_version | target_mdev_type | target_version i915-GVTg_V5_2 | 3 | 915-GVTg_V5_{val1:int:1,2,4,8} | {val3:int:2,3} i915-GVTg_V5_2 | 3 | 915-GVTg_V5_{val1:int:1,2,4,8} | 1
this would either reuiqre use to use a post placment sechudler filter to itrospec this data base or thansform the target_mdev_type and target_version colum data into CUSTOM_* traits we apply to our placment resouce providers and we would have to prefrom multiple reuqest for each posible compatiable alternitive. if the vm has muplite mdevs this is combinatorially problmenatic as it is 1 query for each device * the number of possible compatible devices for that device.
in other word if this is just opaque data we cant ever represent it efficently in our placment service and have to fall back to an explisive post placment schdluer filter base on the db table approch.
this also ignore the fact that at present the mdev_type cannot change druing a migration so the compatiable devicve with a different mdev type would not be considerd accpetable choice in openstack. they way you select a host with a specific vgpu mdev type today is to apply a custome trait which is CUSTOM_<medev_type_goes_here> to the vGPU resouce provider and then in the flavor you request 1 allcoaton of vGPU and require the CUSTOM_<medev_type_goes_here> trait. so going form i915-GVTg_V5_2 to i915-GVTg_V5_{val1:int:1,2,4,8} would not currently be compatiable with that workflow.
The latter would need to be parsed into:
i915-GVTg_V5_1 i915-GVTg_V5_2 i915-GVTg_V5_4 i915-GVTg_V5_8
There is also on the table, migration from physical devices to mdev devices (or vice versa), which is not represented in these examples, nor do I see how we'd represent it. This is where I started exposing the resulting PCI device from the mdev in my example so we could have some commonality between devices, but the migration stream provider is just as important as the type of device, we could have different host drivers providing the same device with incompatible migration streams. The mdev_type encompasses both the driver and device, but we wouldn't have mdev_types for physical devices, per our current thinking.
#cat /sys/bus/pci/dei915-GVTg_V5_{val1:int:1,2,4,8}vices/0000:00:i915- GVTg_V5_{val1:int:1,2,4,8}2.0/UUID2/migration_compatible SELF: device_type=pci device_id=8086591d mdev_type=i915-GVTg_V5_4 aggregator=2 interface_version=1 COMPATIBLE: device_type=pci device_id=8086591d mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8} aggregator={val1}/2 interface_version=1
by the way this is closer to yaml format then it is to json but it does not align with any exsiting format i know of so that just make the representation needless hard to consume if we are going to use a markup lanag let use a standard one like yaml json or toml and not invent a new one.
Notes:
A COMPATIBLE object is a line starting with COMPATIBLE. It specifies a list of compatible devices that are allowed to migrate in. The reason to allow multiple COMPATIBLE objects is that when it is hard to express a complex compatible logic in one COMPATIBLE object, a simple enumeration is still a fallback. in the above example, device UUID2 is in the compatible list of device UUID1, but device UUID1 is not in the compatible list of device UUID2, so device UUID2 is able to migrate to device UUID1, but device UUID1 is not able to migrate to device UUID2.
fields under each object are of "and" relationship to each other, meaning all fields of SELF object of a target device must be equal to corresponding fields of a COMPATIBLE object of source device, otherwise it is regarded as not compatible.
each field, however, is able to specify multiple allowed values, using variables as explained below.
variables are represented with {}, the first appearance of one variable specifies its type and allowed list. e.g. {val1:int:1,2,4,8} represents var1 whose type is integer and allowed values are 1, 2, 4, 8.
vendors are able to specify which fields are within the comparing list and which fields are not. e.g. for physical VF migration, it may not choose mdev_type as a comparing field, and maybe use driver name instead.
this format might be useful to vendors but from a orcestrator perspecive i dont think this has value to us likely we would not use this api if it was added as it does not help us with schduling. ideally instead fo declaring which other mdev types a device is compatiable with (which could presumably change over time as new device and firmwares are released) i would prefer to see a declaritive non vendor specific api that declares the feature set provided by each mdev_type from which we can infer comaptiablity similar to cpu feature flags. for devices fo the same mdev_type name addtionally a declaritive version sting could also be used if required for addtional compatiablity checks.
"non vendor specific api that declares the feature set", aren't features generally vendor specific? What we're trying to describe is, by it's very nature, vendor specific. We don't have an ISO body defining a graphics adapter and enumerating features for that adapter. I think what we have is mdev_types. Each type is supposed to define a specific software interface, perhaps even more so than is done by a PCI vendor:device ID. Maybe that mdev_type needs to be abstracted as something more like a vendor signature, such that a physical device could provide or accept a vendor signature that's compatible with an mdev device. For example, a physically assigned Intel GPU might expose a migration signature of i915-GVTg_v5_8 if it were designed to be compatible with that mdev_type. Thanks,
Alex
yes, include a device_api field is better. for mdev, "device_type=vfio-mdev", is it right?
No, vfio-mdev is not a device API, it's the driver that attaches to the mdev bus device to expose it through vfio. The device_api exposes the actual interface of the vfio device, it's also vfio-pci for typical mdev devices found on x86, but may be vfio-ccw, vfio-ap, etc... See VFIO_DEVICE_API_PCI_STRING and friends.
ok. got it.
device_id=8086591d
Is device_id interpreted relative to device_type? How does this relate to mdev_type? If we have an mdev_type, doesn't that fully defined the software API?
it's parent pci id for mdev actually.
If we need to specify the parent PCI ID then something is fundamentally wrong with the mdev_type. The mdev_type should define a unique, software compatible interface, regardless of the parent device IDs. If a i915-GVTg_V5_2 means different things based on the parent device IDs, then then different mdev_types should be reported for those parent devices.
hmm, then do we allow vendor specific fields? or is it a must that a vendor specific field should have corresponding vendor attribute?
another thing is that the definition of mdev_type in GVT only corresponds to vGPU computing ability currently, e.g. i915-GVTg_V5_2, is 1/2 of a gen9 IGD, i915-GVTg_V4_2 is 1/2 of a gen8 IGD. It is too coarse-grained to live migration compatibility.
Do you think we need to update GVT's definition of mdev_type? And is there any guide in mdev_type definition?
mdev_type=i915-GVTg_V5_2
And how are non-mdev devices represented?
non-mdev can opt to not include this field, or as you said below, a vendor signature.
aggregator=1 pv_mode="none+ppgtt+context"
These are meaningless vendor specific matches afaict.
yes, pv_mode and aggregator are vendor specific fields. but they are important to decide whether two devices are compatible. pv_mode means whether a vGPU supports guest paravirtualized api. "none+ppgtt+context" means guest can not use pv, or use ppgtt mode pv or use context mode pv.
interface_version=3
Not much granularity here, I prefer Sean's previous <major>.<minor>[.bugfix] scheme.
yes, <major>.<minor>[.bugfix] scheme may be better, but I'm not sure if it works for a complicated scenario. e.g for pv_mode, (1) initially, pv_mode is not supported, so it's pv_mode=none, it's 0.0.0, (2) then, pv_mode=ppgtt is supported, pv_mode="none+ppgtt", it's 0.1.0, indicating pv_mode=none can migrate to pv_mode="none+ppgtt", but not vice versa. (3) later, pv_mode=context is also supported, pv_mode="none+ppgtt+context", so it's 0.2.0.
But if later, pv_mode=ppgtt is removed. pv_mode="none+context", how to name its version? "none+ppgtt" (0.1.0) is not compatible to "none+context", but "none+ppgtt+context" (0.2.0) is compatible to "none+context".
If pv_mode=ppgtt is removed, then the compatible versions would be 0.0.0 or 1.0.0, ie. the major version would be incremented due to feature removal.
Maintain such scheme is painful to vendor driver.
Migration compatibility is painful, there's no way around that. I think the version scheme is an attempt to push some of that low level burden on the vendor driver, otherwise the management tools need to work on an ever growing matrix of vendor specific features which is going to become unwieldy and is largely meaningless outside of the vendor driver. Instead, the vendor driver can make strategic decisions about where to continue to maintain a support burden and make explicit decisions to maintain or break compatibility. The version scheme is a simplification and abstraction of vendor driver features in order to create a small, logical compatibility matrix. Compromises necessarily need to be made for that to occur.
ok. got it.
COMPATIBLE: device_type=pci device_id=8086591d mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8}
this mixed notation will be hard to parse so i would avoid that.
Some background, Intel has been proposing aggregation as a solution to how we scale mdev devices when hardware exposes large numbers of assignable objects that can be composed in essentially arbitrary ways. So for instance, if we have a workqueue (wq), we might have an mdev type for 1wq, 2wq, 3wq,... Nwq. It's not really practical to expose a discrete mdev type for each of those, so they want to define a base type which is composable to other types via this aggregation. This is what this substitution and tagging is attempting to accomplish. So imagine this set of values for cases where it's not practical to unroll the values for N discrete types.
aggregator={val1}/2
So the {val1} above would be substituted here, though an aggregation factor of 1/2 is a head scratcher...
pv_mode={val2:string:"none+ppgtt","none+context","none+ppgtt+context"}
I'm lost on this one though. I think maybe it's indicating that it's compatible with any of these, so do we need to list it? Couldn't this be handled by Sean's version proposal where the minor version represents feature compatibility?
yes, it's indicating that it's compatible with any of these. Sean's version proposal may also work, but it would be painful for vendor driver to maintain the versions when multiple similar features are involved.
This is something vendor drivers need to consider when adding and removing features.
interface_version={val3:int:2,3}
What does this turn into in a few years, 2,7,12,23,75,96,...
is a range better?
I was really trying to point out that sparseness becomes an issue if the vendor driver is largely disconnected from how their feature addition and deprecation affects migration support. Thanks,
ok. we'll use the x.y.z scheme then.
Thanks Yan
* Yan Zhao (yan.y.zhao@intel.com) wrote:
yes, include a device_api field is better. for mdev, "device_type=vfio-mdev", is it right?
No, vfio-mdev is not a device API, it's the driver that attaches to the mdev bus device to expose it through vfio. The device_api exposes the actual interface of the vfio device, it's also vfio-pci for typical mdev devices found on x86, but may be vfio-ccw, vfio-ap, etc... See VFIO_DEVICE_API_PCI_STRING and friends.
ok. got it.
device_id=8086591d
Is device_id interpreted relative to device_type? How does this relate to mdev_type? If we have an mdev_type, doesn't that fully defined the software API?
it's parent pci id for mdev actually.
If we need to specify the parent PCI ID then something is fundamentally wrong with the mdev_type. The mdev_type should define a unique, software compatible interface, regardless of the parent device IDs. If a i915-GVTg_V5_2 means different things based on the parent device IDs, then then different mdev_types should be reported for those parent devices.
hmm, then do we allow vendor specific fields? or is it a must that a vendor specific field should have corresponding vendor attribute?
another thing is that the definition of mdev_type in GVT only corresponds to vGPU computing ability currently, e.g. i915-GVTg_V5_2, is 1/2 of a gen9 IGD, i915-GVTg_V4_2 is 1/2 of a gen8 IGD. It is too coarse-grained to live migration compatibility.
Can you explain why that's too coarse?
Is this because it's too specific (i.e. that a i915-GVTg_V4_2 could be migrated to a newer device?), or that it's too specific on the exact sizings (i.e. that there may be multiple different sizes of a gen9)?
Dave
Do you think we need to update GVT's definition of mdev_type? And is there any guide in mdev_type definition?
mdev_type=i915-GVTg_V5_2
And how are non-mdev devices represented?
non-mdev can opt to not include this field, or as you said below, a vendor signature.
aggregator=1 pv_mode="none+ppgtt+context"
These are meaningless vendor specific matches afaict.
yes, pv_mode and aggregator are vendor specific fields. but they are important to decide whether two devices are compatible. pv_mode means whether a vGPU supports guest paravirtualized api. "none+ppgtt+context" means guest can not use pv, or use ppgtt mode pv or use context mode pv.
interface_version=3
Not much granularity here, I prefer Sean's previous <major>.<minor>[.bugfix] scheme.
yes, <major>.<minor>[.bugfix] scheme may be better, but I'm not sure if it works for a complicated scenario. e.g for pv_mode, (1) initially, pv_mode is not supported, so it's pv_mode=none, it's 0.0.0, (2) then, pv_mode=ppgtt is supported, pv_mode="none+ppgtt", it's 0.1.0, indicating pv_mode=none can migrate to pv_mode="none+ppgtt", but not vice versa. (3) later, pv_mode=context is also supported, pv_mode="none+ppgtt+context", so it's 0.2.0.
But if later, pv_mode=ppgtt is removed. pv_mode="none+context", how to name its version? "none+ppgtt" (0.1.0) is not compatible to "none+context", but "none+ppgtt+context" (0.2.0) is compatible to "none+context".
If pv_mode=ppgtt is removed, then the compatible versions would be 0.0.0 or 1.0.0, ie. the major version would be incremented due to feature removal.
Maintain such scheme is painful to vendor driver.
Migration compatibility is painful, there's no way around that. I think the version scheme is an attempt to push some of that low level burden on the vendor driver, otherwise the management tools need to work on an ever growing matrix of vendor specific features which is going to become unwieldy and is largely meaningless outside of the vendor driver. Instead, the vendor driver can make strategic decisions about where to continue to maintain a support burden and make explicit decisions to maintain or break compatibility. The version scheme is a simplification and abstraction of vendor driver features in order to create a small, logical compatibility matrix. Compromises necessarily need to be made for that to occur.
ok. got it.
COMPATIBLE: device_type=pci device_id=8086591d mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8}
this mixed notation will be hard to parse so i would avoid that.
Some background, Intel has been proposing aggregation as a solution to how we scale mdev devices when hardware exposes large numbers of assignable objects that can be composed in essentially arbitrary ways. So for instance, if we have a workqueue (wq), we might have an mdev type for 1wq, 2wq, 3wq,... Nwq. It's not really practical to expose a discrete mdev type for each of those, so they want to define a base type which is composable to other types via this aggregation. This is what this substitution and tagging is attempting to accomplish. So imagine this set of values for cases where it's not practical to unroll the values for N discrete types.
aggregator={val1}/2
So the {val1} above would be substituted here, though an aggregation factor of 1/2 is a head scratcher...
pv_mode={val2:string:"none+ppgtt","none+context","none+ppgtt+context"}
I'm lost on this one though. I think maybe it's indicating that it's compatible with any of these, so do we need to list it? Couldn't this be handled by Sean's version proposal where the minor version represents feature compatibility?
yes, it's indicating that it's compatible with any of these. Sean's version proposal may also work, but it would be painful for vendor driver to maintain the versions when multiple similar features are involved.
This is something vendor drivers need to consider when adding and removing features.
interface_version={val3:int:2,3}
What does this turn into in a few years, 2,7,12,23,75,96,...
is a range better?
I was really trying to point out that sparseness becomes an issue if the vendor driver is largely disconnected from how their feature addition and deprecation affects migration support. Thanks,
ok. we'll use the x.y.z scheme then.
Thanks Yan
-- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
On Wed, Jul 29, 2020 at 12:28:46PM +0100, Sean Mooney wrote:
On Wed, 2020-07-29 at 16:05 +0800, Yan Zhao wrote:
On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote:
On Mon, 27 Jul 2020 15:24:40 +0800 Yan Zhao yan.y.zhao@intel.com wrote:
As you indicate, the vendor driver is responsible for checking version information embedded within the migration stream. Therefore a migration should fail early if the devices are incompatible. Is it
but as I know, currently in VFIO migration protocol, we have no way to get vendor specific compatibility checking string in migration setup stage (i.e. .save_setup stage) before the device is set to _SAVING state. In this way, for devices who does not save device data in precopy stage, the migration compatibility checking is as late as in stop-and-copy stage, which is too late. do you think we need to add the getting/checking of vendor specific compatibility string early in save_setup stage?
hi Alex, after an offline discussion with Kevin, I realized that it may not be a problem if migration compatibility check in vendor driver occurs late in stop-and-copy phase for some devices, because if we report device compatibility attributes clearly in an interface, the chances for libvirt/openstack to make a wrong decision is little.
I think it would be wise for a vendor driver to implement a pre-copy phase, even if only to send version information and verify it at the target. Deciding you have no device state to send during pre-copy does not mean your vendor driver needs to opt-out of the pre-copy phase entirely. Please also note that pre-copy is at the user's discretion, we've defined that we can enter stop-and-copy at any point, including without a pre-copy phase, so I would recommend that vendor drivers validate compatibility at the start of both the pre-copy and the stop-and-copy phases.
ok. got it!
so, do you think we are now arriving at an agreement that we'll give up the read-and-test scheme and start to defining one interface (perhaps in json format), from which libvirt/openstack is able to parse and find out compatibility list of a source mdev/physical device?
Based on the feedback we've received, the previously proposed interface is not viable. I think there's agreement that the user needs to be able to parse and interpret the version information. Using json seems viable, but I don't know if it's the best option. Is there any precedent of markup strings returned via sysfs we could follow?
I found some examples of using formatted string under /sys, mostly under tracing. maybe we can do a similar implementation.
#cat /sys/kernel/debug/tracing/events/kvm/kvm_mmio/format
name: kvm_mmio ID: 32 format: field:unsigned short common_type; offset:0; size:2; signed:0; field:unsigned char common_flags; offset:2; size:1; signed:0; field:unsigned char common_preempt_count; offset:3; size:1; signed:0; field:int common_pid; offset:4; size:4; signed:1;
field:u32 type; offset:8; size:4; signed:0; field:u32 len; offset:12; size:4; signed:0; field:u64 gpa; offset:16; size:8; signed:0; field:u64 val; offset:24; size:8; signed:0;
print fmt: "mmio %s len %u gpa 0x%llx val 0x%llx", __print_symbolic(REC->type, { 0, "unsatisfied-read" }, { 1, "read" }, { 2, "write" }), REC->len, REC->gpa, REC->val
this is not json fromat and its not supper frendly to parse.
yes, it's just an example. It's exported to be used by userspace perf & trace_cmd.
#cat /sys/devices/pci0000:00/0000:00:02.0/uevent DRIVER=vfio-pci PCI_CLASS=30000 PCI_ID=8086:591D PCI_SUBSYS_ID=8086:2212 PCI_SLOT_NAME=0000:00:02.0 MODALIAS=pci:v00008086d0000591Dsv00008086sd00002212bc03sc00i00
this is ini format or conf formant this is pretty simple to parse whichi would be fine. that said you could also have a version or capablitiy directory with a file for each key and a singel value.
if this is easy for openstack, maybe we can organize the data like below way?
|- [device] |- migration |-self |-compatible1 |-compatible2
e.g. #cat /sys/bus/pci/devices/0000:00:02.0/UUID1/migration/self filed1=xxx filed2=xxx filed3=xxx filed3=xxx #cat /sys/bus/pci/devices/0000:00:02.0/UUID1/migration/compatible filed1=xxx filed2=xxx filed3=xxx filed3=xxx
or in a flat layer |- [device] |- migration-self-traits |- migration-compatible-traits
I'm not sure whether json format in a single file is better, as I didn't find any precedent.
i would prefer to only have to do one read personally the list the files in directory and then read tehm all ot build the datastucture myself but that is doable though the simple ini format use d for uevent seams the best of 3 options provided above.
Your idea of having both a "self" object and an array of "compatible" objects is perhaps something we can build on, but we must not assume PCI devices at the root level of the object. Providing both the mdev-type and the driver is a bit redundant, since the former includes the latter. We can't have vendor specific versioning schemes though, ie. gvt-version. We need to agree on a common scheme and decide which fields the version is relative to, ex. just the mdev type?
what about making all comparing fields vendor specific? userspace like openstack only needs to parse and compare if target device is within source compatible list without understanding the meaning of each field.
that kind of defeats the reason for having them be be parsable. the reason openstack want to be able to understand the capablitys is so we can staticaly declare the capablit of devices ahead of time on so our schduler can select host based on that. is the keys and data are opaquce to userspace becaue they are just random vendor sepecific blobs we cant do that.
I heard that cyborg can parse the kernel interface and generate several traits without understanding the meaning of each trait. Then it reports those traits to placement for scheduling.
but I agree if mdev creation is involved, those traits need to match to mdev attributes and mdev_type.
could you explain a little how you plan to create a target mdev device? is it dynamically created during searching of compatible mdevs or just statically created before migration?
I had also proposed fields that provide information to create a compatible type, for example to create a type_x2 device from a type_x1 mdev type, they need to know to apply an aggregation attribute. If we need to explicitly list every aggregation value and the resulting type, I think we run aground of what aggregation was trying to avoid anyway, so we might need to pick a language that defines variable substitution or some kind of tagging. For example if we could define ${aggr} as an integer within a specified range, then we might be able to define a type relative to that value (type_x${aggr}) which requires an aggregation attribute using the same value. I dunno, just spit balling. Thanks,
what about a migration_compatible attribute under device node like below?
rather then listing comaptiable devices it would be better if you could declaritivly list the feature supported and we could compare those along with a simple semver version string.
I think below is already in a way of listing feature supported. The reason I also want to declare compatible lists of features is that sometimes it's not a simple 1:1 matching of source list and target list. as I demonstrated below, source mdev of (mdev_type i915-GVTg_V5_2 + aggregator 1) is compatible to target mdev of (mdev_type i915-GVTg_V5_4 + aggregator 2), (mdev_type i915-GVTg_V5_8 + aggregator 4)
and aggragator may be just one of such examples that 1:1 matching is not fit. so I guess it's best not to leave the hard decision to openstack.
Thanks Yan
#cat /sys/bus/pci/devices/0000:00:02.0/UUID1/migration_compatible SELF: device_type=pci device_id=8086591d mdev_type=i915-GVTg_V5_2 aggregator=1 pv_mode="none+ppgtt+context" interface_version=3 COMPATIBLE: device_type=pci device_id=8086591d mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8}
this mixed notation will be hard to parse so i would avoid that.
aggregator={val1}/2 pv_mode={val2:string:"none+ppgtt","none+context","none+ppgtt+context"}
interface_version={val3:int:2,3} COMPATIBLE: device_type=pci device_id=8086591d mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8} aggregator={val1}/2 pv_mode="" #"" meaning empty, could be absent in a compatible device interface_version=1
if you presented this information the only way i could see to use it would be to extract the mdev_type name and interface_vertion and build a database table as follows
source_mdev_type | source_version | target_mdev_type | target_version i915-GVTg_V5_2 | 3 | 915-GVTg_V5_{val1:int:1,2,4,8} | {val3:int:2,3} i915-GVTg_V5_2 | 3 | 915-GVTg_V5_{val1:int:1,2,4,8} | 1
this would either reuiqre use to use a post placment sechudler filter to itrospec this data base or thansform the target_mdev_type and target_version colum data into CUSTOM_* traits we apply to our placment resouce providers and we would have to prefrom multiple reuqest for each posible compatiable alternitive. if the vm has muplite mdevs this is combinatorially problmenatic as it is 1 query for each device * the number of possible compatible devices for that device.
in other word if this is just opaque data we cant ever represent it efficently in our placment service and have to fall back to an explisive post placment schdluer filter base on the db table approch.
this also ignore the fact that at present the mdev_type cannot change druing a migration so the compatiable devicve with a different mdev type would not be considerd accpetable choice in openstack. they way you select a host with a specific vgpu mdev type today is to apply a custome trait which is CUSTOM_<medev_type_goes_here> to the vGPU resouce provider and then in the flavor you request 1 allcoaton of vGPU and require the CUSTOM_<medev_type_goes_here> trait. so going form i915-GVTg_V5_2 to i915-GVTg_V5_{val1:int:1,2,4,8} would not currently be compatiable with that workflow.
#cat /sys/bus/pci/dei915-GVTg_V5_{val1:int:1,2,4,8}vices/0000:00:i915- GVTg_V5_{val1:int:1,2,4,8}2.0/UUID2/migration_compatible SELF: device_type=pci device_id=8086591d mdev_type=i915-GVTg_V5_4 aggregator=2 interface_version=1 COMPATIBLE: device_type=pci device_id=8086591d mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8} aggregator={val1}/2 interface_version=1
by the way this is closer to yaml format then it is to json but it does not align with any exsiting format i know of so that just make the representation needless hard to consume if we are going to use a markup lanag let use a standard one like yaml json or toml and not invent a new one.
Notes:
A COMPATIBLE object is a line starting with COMPATIBLE. It specifies a list of compatible devices that are allowed to migrate in. The reason to allow multiple COMPATIBLE objects is that when it is hard to express a complex compatible logic in one COMPATIBLE object, a simple enumeration is still a fallback. in the above example, device UUID2 is in the compatible list of device UUID1, but device UUID1 is not in the compatible list of device UUID2, so device UUID2 is able to migrate to device UUID1, but device UUID1 is not able to migrate to device UUID2.
fields under each object are of "and" relationship to each other, meaning all fields of SELF object of a target device must be equal to corresponding fields of a COMPATIBLE object of source device, otherwise it is regarded as not compatible.
each field, however, is able to specify multiple allowed values, using variables as explained below.
variables are represented with {}, the first appearance of one variable specifies its type and allowed list. e.g. {val1:int:1,2,4,8} represents var1 whose type is integer and allowed values are 1, 2, 4, 8.
vendors are able to specify which fields are within the comparing list and which fields are not. e.g. for physical VF migration, it may not choose mdev_type as a comparing field, and maybe use driver name instead.
this format might be useful to vendors but from a orcestrator perspecive i dont think this has value to us likely we would not use this api if it was added as it does not help us with schduling. ideally instead fo declaring which other mdev types a device is compatiable with (which could presumably change over time as new device and firmwares are released) i would prefer to see a declaritive non vendor specific api that declares the feature set provided by each mdev_type from which we can infer comaptiablity similar to cpu feature flags. for devices fo the same mdev_type name addtionally a declaritive version sting could also be used if required for addtional compatiablity checks.
Thanks Yan
On Thu, 2020-07-30 at 09:56 +0800, Yan Zhao wrote:
On Wed, Jul 29, 2020 at 12:28:46PM +0100, Sean Mooney wrote:
On Wed, 2020-07-29 at 16:05 +0800, Yan Zhao wrote:
On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote:
On Mon, 27 Jul 2020 15:24:40 +0800 Yan Zhao yan.y.zhao@intel.com wrote:
> As you indicate, the vendor driver is responsible for checking version > information embedded within the migration stream. Therefore a > migration should fail early if the devices are incompatible. Is it
but as I know, currently in VFIO migration protocol, we have no way to get vendor specific compatibility checking string in migration setup stage (i.e. .save_setup stage) before the device is set to _SAVING state. In this way, for devices who does not save device data in precopy stage, the migration compatibility checking is as late as in stop-and-copy stage, which is too late. do you think we need to add the getting/checking of vendor specific compatibility string early in save_setup stage?
hi Alex, after an offline discussion with Kevin, I realized that it may not be a problem if migration compatibility check in vendor driver occurs late in stop-and-copy phase for some devices, because if we report device compatibility attributes clearly in an interface, the chances for libvirt/openstack to make a wrong decision is little.
I think it would be wise for a vendor driver to implement a pre-copy phase, even if only to send version information and verify it at the target. Deciding you have no device state to send during pre-copy does not mean your vendor driver needs to opt-out of the pre-copy phase entirely. Please also note that pre-copy is at the user's discretion, we've defined that we can enter stop-and-copy at any point, including without a pre-copy phase, so I would recommend that vendor drivers validate compatibility at the start of both the pre-copy and the stop-and-copy phases.
ok. got it!
so, do you think we are now arriving at an agreement that we'll give up the read-and-test scheme and start to defining one interface (perhaps in json format), from which libvirt/openstack is able to parse and find out compatibility list of a source mdev/physical device?
Based on the feedback we've received, the previously proposed interface is not viable. I think there's agreement that the user needs to be able to parse and interpret the version information. Using json seems viable, but I don't know if it's the best option. Is there any precedent of markup strings returned via sysfs we could follow?
I found some examples of using formatted string under /sys, mostly under tracing. maybe we can do a similar implementation.
#cat /sys/kernel/debug/tracing/events/kvm/kvm_mmio/format
name: kvm_mmio ID: 32 format: field:unsigned short common_type; offset:0; size:2; signed:0; field:unsigned char common_flags; offset:2; size:1; signed:0; field:unsigned char common_preempt_count; offset:3; size:1; signed:0; field:int common_pid; offset:4; size:4; signed:1;
field:u32 type; offset:8; size:4; signed:0; field:u32 len; offset:12; size:4; signed:0; field:u64 gpa; offset:16; size:8; signed:0; field:u64 val; offset:24; size:8; signed:0;
print fmt: "mmio %s len %u gpa 0x%llx val 0x%llx", __print_symbolic(REC->type, { 0, "unsatisfied-read" }, { 1, "read" }, { 2, "write" }), REC->len, REC->gpa, REC->val
this is not json fromat and its not supper frendly to parse.
yes, it's just an example. It's exported to be used by userspace perf & trace_cmd.
#cat /sys/devices/pci0000:00/0000:00:02.0/uevent DRIVER=vfio-pci PCI_CLASS=30000 PCI_ID=8086:591D PCI_SUBSYS_ID=8086:2212 PCI_SLOT_NAME=0000:00:02.0 MODALIAS=pci:v00008086d0000591Dsv00008086sd00002212bc03sc00i00
this is ini format or conf formant this is pretty simple to parse whichi would be fine. that said you could also have a version or capablitiy directory with a file for each key and a singel value.
if this is easy for openstack, maybe we can organize the data like below way?
|- [device] |- migration |-self |-compatible1 |-compatible2
e.g. #cat /sys/bus/pci/devices/0000:00:02.0/UUID1/migration/self filed1=xxx filed2=xxx filed3=xxx filed3=xxx #cat /sys/bus/pci/devices/0000:00:02.0/UUID1/migration/compatible filed1=xxx filed2=xxx filed3=xxx filed3=xxx
ya this would work. nova specificly the libvirt driver trys to avoid reading sysfs directly if libvirt has an api that provides the infomation but where it does not it can read it and that structure woudl be easy for use to consume.
libs like os-vif which cant depend on libvirt use it a little more for example to look up a PF form one of its VFs https://github.com/openstack/os-vif/blob/master/vif_plug_ovs/linux_net.py#L3...
we are carefult not to over use sysfs as it can change over time based on kernel version in somecase but its is genernal seen a preferable to calling an every growing list of comnnadline clients to retrive the same info.
or in a flat layer |- [device] |- migration-self-traits |- migration-compatible-traits
I'm not sure whether json format in a single file is better, as I didn't find any precedent.
i think i prefer the nested directories to this flatend styple but there isnent really any significant increase in complexity form a bash scripting point of view if i was manually debuging something the multi layer reprentation is slight simpler to work with but not enough so that it really matters.
i would prefer to only have to do one read personally the list the files in directory and then read tehm all ot build the datastucture myself but that is doable though the simple ini format use d for uevent seams the best of 3 options provided above.
Your idea of having both a "self" object and an array of "compatible" objects is perhaps something we can build on, but we must not assume PCI devices at the root level of the object. Providing both the mdev-type and the driver is a bit redundant, since the former includes the latter. We can't have vendor specific versioning schemes though, ie. gvt-version. We need to agree on a common scheme and decide which fields the version is relative to, ex. just the mdev type?
what about making all comparing fields vendor specific? userspace like openstack only needs to parse and compare if target device is within source compatible list without understanding the meaning of each field.
that kind of defeats the reason for having them be be parsable. the reason openstack want to be able to understand the capablitys is so we can staticaly declare the capablit of devices ahead of time on so our schduler can select host based on that. is the keys and data are opaquce to userspace becaue they are just random vendor sepecific blobs we cant do that.
I heard that cyborg can parse the kernel interface and generate several traits without understanding the meaning of each trait. Then it reports those traits to placement for scheduling.
if it is doing a raw passthough like that 1 it will break users if a vendor every removes a trait or renames it as part of a firwmware update and second it will require them to use CUSTOM_ triant in stead of standardised traits. in other words is an interoperatbltiy problem between clouds.
at present cyborg does not support mdevs there is a proposal for adding a generic mdev driver for generic stateless devices. it could report arbitary capablity to placment although its does not exsit yet so its kind of premature ot point to it as an example
but I agree if mdev creation is involved, those traits need to match to mdev attributes and mdev_type.
currently the only use of mdevs in openstack is for vGPU with nvidia devices. in theory intel gpus can work with the existing code but it has not been tested.
could you explain a little how you plan to create a target mdev device? is it dynamically created during searching of compatible mdevs or just statically created before migration?
the mdevs are currently created dynamically when a vm is created based on a set of pre defiend flavor which have static metadata in the form of flavor extra_specs. thost extra specs request a vgpu by spcifying resouces:VGPU=1 in the extra specs. e.g. openstack flavor set vgpu_1 --property "resources:VGPU=1" if you want a specific vgpu type then you must request a custom trait in addtion to the resouce class openstack --os-placement-api-version 1.6 trait create CUSTOM_NVIDIA_11 openstack flavor set --property trait:CUSTOM_NVIDIA_11=required vgpu_1
when configuring the host for vGPUs you list the enabled vgpu mdev types and the device that can use them
[devices] enabled_vgpu_types = nvidia-35, nvidia-36
[vgpu_nvidia-35] device_addresses = 0000:84:00.0,0000:85:00.0
[vgpu_nvidia-36] device_addresses = 0000:86:00.0
each device that is listed will be created as a resouce provider in the plamcent service so to associate the custom trait with the physical gpu and mdev type you manually tag the RP withthe trait
openstack --os-placement-api-version 1.6 resource provider trait set \ --trait CUSTOM_NVIDIA_11 e2f8607b-0683-4141-a8af-f5e20682e28c
this decouple the name of the CUSTOM_ trait form the underliying mdev type so the operator is free to use small|medium|large or bronze|silver|gold if they want to or they coudld chose to use the mdev_type name if they want too.
currently we dont support live migration with vGPU because the required code has not been upstream to qemu/kvm yet? i belive it just missed the kernel 5.7 merge window? i know its in flight but have not been following too closely
if you do a cold/offline migration currenlty and you had multiple mdev types then technical the mdev type could change. we had planned for operators to ensure that what ever trait they choose would map to the same mdev type on all hosts. if we were to supprot live migration in the future without this new api we are disccusing we woudl make the trait to mdev type relationship required to be 1:1 for live migration.
we have talked auto creating traits for gvpus which would be in the form of CUSTOM_<mdev type> but shyed away from it as we are worried vendors will break us and our user by changing mdev_types in frimware updates or driver updates. we kind of need to rely on them being stable but we are hesitent to encode them in our public api in this manner.
I had also proposed fields that provide information to create a compatible type, for example to create a type_x2 device from a type_x1 mdev type, they need to know to apply an aggregation attribute.
honestly form an opesntack point of view i woudl prefer if each consumable resouce was exposed as a different mdev_type and we could just create multiple mdevs and attach them to a vm. that would allow use to do the aggreatation our selves. parsing mdev atributes and dynamicaly creating 1 mdev type from aggregation of other requires detailed knoladge of the vendor device.
the cyborg(acclerator managment) project might be open to this becuase they have plugable vendor specific and could write a driver that only work with a sepecifc sku of a vendoer deivce or a device familay e.g. a qat driver that could have the require knoladge to do the compostion.
that type of lowlevel device management is out of scope of the nova (compute) project we woudl be far more likely to require operator to staticly parttion the device up front into mdevs and pass us a list of them which we could then provend to vms.
we more or less already do this for vGPU today as the phsycal gpus need to be declared to support exactly 1 mdev_type each and the same is true for persistent memroy. you need to pre create the persistent memeroy namespaces and then provide the list of namespaces to nova.
so aggregation is something i suspect taht will only be supported in cyborg if it eventually supprot mdevs. it has not been requested or assesed for nova yet but it seams unlikely. in a migration work flow i would expect the nova conduction or source host to make an rpc call to the destination host in pre live migration to create the mdev. this is before the call to libvirt to migrate the vm and before it would do any validation but after schduleing. so ideally we shoudl know at this point that the destination host has a comaptiable device.
If we need to explicitly list every aggregation value and the resulting type, I think we run aground of what aggregation was trying to avoid anyway, so we might need to pick a language that defines variable substitution or some kind of tagging. For example if we could define ${aggr} as an integer within a specified range, then we might be able to define a type relative to that value (type_x${aggr}) which requires an aggregation attribute using the same value. I dunno, just spit balling. Thanks,
what about a migration_compatible attribute under device node like below?
rather then listing comaptiable devices it would be better if you could declaritivly list the feature supported and we could compare those along with a simple semver version string.
I think below is already in a way of listing feature supported. The reason I also want to declare compatible lists of features is that sometimes it's not a simple 1:1 matching of source list and target list. as I demonstrated below, source mdev of (mdev_type i915-GVTg_V5_2 + aggregator 1) is compatible to target mdev of (mdev_type i915-GVTg_V5_4 + aggregator 2), (mdev_type i915-GVTg_V5_8 + aggregator 4)
and aggragator may be just one of such examples that 1:1 matching is not fit.
so far i am not conviced that aggragators are a good concept to model at this level. is there some document that explains why they are need and we cant jsut have multipel mdev_type per consumable resouce and attach multiple mdevs to a singel vm.
i suspect this is due to limitation in compoasblity in hardware such as nvidia multi instance gpu tech. however (mdev_type i915-GVTg_V5_8 + aggregator 4) seams unfriendly to work with form an orchestrato perspective.
on of our current complaint with the mdev api today is that depending on the device consoming and instance of 1 mdev type may impact the availablity of other or change the avaiablity capastiyt of others. that make it very hard to reason about capastiy avaiablity and aggregator sound like it will make that problem worse not better.
so I guess it's best not to leave the hard decision to openstack.
Thanks Yan
#cat /sys/bus/pci/devices/0000:00:02.0/UUID1/migration_compatible SELF: device_type=pci device_id=8086591d mdev_type=i915-GVTg_V5_2 aggregator=1 pv_mode="none+ppgtt+context" interface_version=3 COMPATIBLE: device_type=pci device_id=8086591d mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8}
this mixed notation will be hard to parse so i would avoid that.
aggregator={val1}/2 pv_mode={val2:string:"none+ppgtt","none+context","none+ppgtt+context"}
interface_version={val3:int:2,3} COMPATIBLE: device_type=pci device_id=8086591d mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8} aggregator={val1}/2 pv_mode="" #"" meaning empty, could be absent in a compatible device interface_version=1
if you presented this information the only way i could see to use it would be to extract the mdev_type name and interface_vertion and build a database table as follows
source_mdev_type | source_version | target_mdev_type | target_version i915-GVTg_V5_2 | 3 | 915-GVTg_V5_{val1:int:1,2,4,8} | {val3:int:2,3} i915-GVTg_V5_2 | 3 | 915-GVTg_V5_{val1:int:1,2,4,8} | 1
this would either reuiqre use to use a post placment sechudler filter to itrospec this data base or thansform the target_mdev_type and target_version colum data into CUSTOM_* traits we apply to our placment resouce providers and we would have to prefrom multiple reuqest for each posible compatiable alternitive. if the vm has muplite mdevs this is combinatorially problmenatic as it is 1 query for each device * the number of possible compatible devices for that device.
in other word if this is just opaque data we cant ever represent it efficently in our placment service and have to fall back to an explisive post placment schdluer filter base on the db table approch.
this also ignore the fact that at present the mdev_type cannot change druing a migration so the compatiable devicve with a different mdev type would not be considerd accpetable choice in openstack. they way you select a host with a specific vgpu mdev type today is to apply a custome trait which is CUSTOM_<medev_type_goes_here> to the vGPU resouce provider and then in the flavor you request 1 allcoaton of vGPU and require the CUSTOM_<medev_type_goes_here> trait. so going form i915-GVTg_V5_2 to i915-GVTg_V5_{val1:int:1,2,4,8} would not currently be compatiable with that workflow.
#cat /sys/bus/pci/dei915-GVTg_V5_{val1:int:1,2,4,8}vices/0000:00:i915- GVTg_V5_{val1:int:1,2,4,8}2.0/UUID2/migration_compatible SELF: device_type=pci device_id=8086591d mdev_type=i915-GVTg_V5_4 aggregator=2 interface_version=1 COMPATIBLE: device_type=pci device_id=8086591d mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8} aggregator={val1}/2 interface_version=1
by the way this is closer to yaml format then it is to json but it does not align with any exsiting format i know of so that just make the representation needless hard to consume if we are going to use a markup lanag let use a standard one like yaml json or toml and not invent a new one.
Notes:
A COMPATIBLE object is a line starting with COMPATIBLE. It specifies a list of compatible devices that are allowed to migrate in. The reason to allow multiple COMPATIBLE objects is that when it is hard to express a complex compatible logic in one COMPATIBLE object, a simple enumeration is still a fallback. in the above example, device UUID2 is in the compatible list of device UUID1, but device UUID1 is not in the compatible list of device UUID2, so device UUID2 is able to migrate to device UUID1, but device UUID1 is not able to migrate to device UUID2.
fields under each object are of "and" relationship to each other, meaning all fields of SELF object of a target device must be equal to corresponding fields of a COMPATIBLE object of source device, otherwise it is regarded as not compatible.
each field, however, is able to specify multiple allowed values, using variables as explained below.
variables are represented with {}, the first appearance of one variable specifies its type and allowed list. e.g. {val1:int:1,2,4,8} represents var1 whose type is integer and allowed values are 1, 2, 4, 8.
vendors are able to specify which fields are within the comparing list and which fields are not. e.g. for physical VF migration, it may not choose mdev_type as a comparing field, and maybe use driver name instead.
this format might be useful to vendors but from a orcestrator perspecive i dont think this has value to us likely we would not use this api if it was added as it does not help us with schduling. ideally instead fo declaring which other mdev types a device is compatiable with (which could presumably change over time as new device and firmwares are released) i would prefer to see a declaritive non vendor specific api that declares the feature set provided by each mdev_type from which we can infer comaptiablity similar to cpu feature flags. for devices fo the same mdev_type name addtionally a declaritive version sting could also be used if required for addtional compatiablity checks.
Thanks Yan
[sorry about not chiming in earlier]
On Wed, 29 Jul 2020 16:05:03 +0800 Yan Zhao yan.y.zhao@intel.com wrote:
On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote:
(...)
Based on the feedback we've received, the previously proposed interface is not viable. I think there's agreement that the user needs to be able to parse and interpret the version information. Using json seems viable, but I don't know if it's the best option. Is there any precedent of markup strings returned via sysfs we could follow?
I don't think encoding complex information in a sysfs file is a viable approach. Quoting Documentation/filesystems/sysfs.rst:
"Attributes should be ASCII text files, preferably with only one value per file. It is noted that it may not be efficient to contain only one value per file, so it is socially acceptable to express an array of values of the same type.
Mixing types, expressing multiple lines of data, and doing fancy formatting of data is heavily frowned upon."
Even though this is an older file, I think these restrictions still apply.
I found some examples of using formatted string under /sys, mostly under tracing. maybe we can do a similar implementation.
#cat /sys/kernel/debug/tracing/events/kvm/kvm_mmio/format
Note that this is *not* sysfs (anything under debug/ follows different rules anyway!)
name: kvm_mmio ID: 32 format: field:unsigned short common_type; offset:0; size:2; signed:0; field:unsigned char common_flags; offset:2; size:1; signed:0; field:unsigned char common_preempt_count; offset:3; size:1; signed:0; field:int common_pid; offset:4; size:4; signed:1;
field:u32 type; offset:8; size:4; signed:0; field:u32 len; offset:12; size:4; signed:0; field:u64 gpa; offset:16; size:8; signed:0; field:u64 val; offset:24; size:8; signed:0;
print fmt: "mmio %s len %u gpa 0x%llx val 0x%llx", __print_symbolic(REC->type, { 0, "unsatisfied-read" }, { 1, "read" }, { 2, "write" }), REC->len, REC->gpa, REC->val
#cat /sys/devices/pci0000:00/0000:00:02.0/uevent
'uevent' can probably be considered a special case, I would not really want to copy it.
DRIVER=vfio-pci PCI_CLASS=30000 PCI_ID=8086:591D PCI_SUBSYS_ID=8086:2212 PCI_SLOT_NAME=0000:00:02.0 MODALIAS=pci:v00008086d0000591Dsv00008086sd00002212bc03sc00i00
(...)
what about a migration_compatible attribute under device node like below?
#cat /sys/bus/pci/devices/0000:00:02.0/UUID1/migration_compatible SELF: device_type=pci device_id=8086591d mdev_type=i915-GVTg_V5_2 aggregator=1 pv_mode="none+ppgtt+context" interface_version=3 COMPATIBLE: device_type=pci device_id=8086591d mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8} aggregator={val1}/2 pv_mode={val2:string:"none+ppgtt","none+context","none+ppgtt+context"} interface_version={val3:int:2,3} COMPATIBLE: device_type=pci device_id=8086591d mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8} aggregator={val1}/2 pv_mode="" #"" meaning empty, could be absent in a compatible device interface_version=1
I'd consider anything of a comparable complexity to be a big no-no. If anything, this needs to be split into individual files (with many of them being vendor driver specific anyway.)
I think we can list compatible versions in a range/list format, though. Something like
cat interface_version 2.1.3
cat interface_version_compatible 2.0.2-2.0.4,2.1.0-
(indicating that versions 2.0.{2,3,4} and all versions after 2.1.0 are compatible, considering versions <2 and >2 incompatible by default)
Possible compatibility between different mdev types feels a bit odd to me, and should not be included by default (only if it makes sense for a particular vendor driver.)
On 2020/8/5 上午12:35, Cornelia Huck wrote:
[sorry about not chiming in earlier]
On Wed, 29 Jul 2020 16:05:03 +0800 Yan Zhao yan.y.zhao@intel.com wrote:
On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote:
(...)
Based on the feedback we've received, the previously proposed interface is not viable. I think there's agreement that the user needs to be able to parse and interpret the version information. Using json seems viable, but I don't know if it's the best option. Is there any precedent of markup strings returned via sysfs we could follow?
I don't think encoding complex information in a sysfs file is a viable approach. Quoting Documentation/filesystems/sysfs.rst:
"Attributes should be ASCII text files, preferably with only one value per file. It is noted that it may not be efficient to contain only one value per file, so it is socially acceptable to express an array of values of the same type.
Mixing types, expressing multiple lines of data, and doing fancy formatting of data is heavily frowned upon."
Even though this is an older file, I think these restrictions still apply.
+1, that's another reason why devlink(netlink) is better.
Thanks
On Wed, Aug 05, 2020 at 10:22:15AM +0800, Jason Wang wrote:
On 2020/8/5 上午12:35, Cornelia Huck wrote:
[sorry about not chiming in earlier]
On Wed, 29 Jul 2020 16:05:03 +0800 Yan Zhao yan.y.zhao@intel.com wrote:
On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote:
(...)
Based on the feedback we've received, the previously proposed interface is not viable. I think there's agreement that the user needs to be able to parse and interpret the version information. Using json seems viable, but I don't know if it's the best option. Is there any precedent of markup strings returned via sysfs we could follow?
I don't think encoding complex information in a sysfs file is a viable approach. Quoting Documentation/filesystems/sysfs.rst:
"Attributes should be ASCII text files, preferably with only one value per file. It is noted that it may not be efficient to contain only one value per file, so it is socially acceptable to express an array of values of the same type. Mixing types, expressing multiple lines of data, and doing fancy formatting of data is heavily frowned upon."
Even though this is an older file, I think these restrictions still apply.
+1, that's another reason why devlink(netlink) is better.
hi Jason, do you have any materials or sample code about devlink, so we can have a good study of it? I found some kernel docs about it but my preliminary study didn't show me the advantage of devlink.
Thanks Yan
On 2020/8/5 上午10:16, Yan Zhao wrote:
On Wed, Aug 05, 2020 at 10:22:15AM +0800, Jason Wang wrote:
On 2020/8/5 上午12:35, Cornelia Huck wrote:
[sorry about not chiming in earlier]
On Wed, 29 Jul 2020 16:05:03 +0800 Yan Zhao yan.y.zhao@intel.com wrote:
On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote:
(...)
Based on the feedback we've received, the previously proposed interface is not viable. I think there's agreement that the user needs to be able to parse and interpret the version information. Using json seems viable, but I don't know if it's the best option. Is there any precedent of markup strings returned via sysfs we could follow?
I don't think encoding complex information in a sysfs file is a viable approach. Quoting Documentation/filesystems/sysfs.rst:
"Attributes should be ASCII text files, preferably with only one value per file. It is noted that it may not be efficient to contain only one value per file, so it is socially acceptable to express an array of values of the same type. Mixing types, expressing multiple lines of data, and doing fancy formatting of data is heavily frowned upon."
Even though this is an older file, I think these restrictions still apply.
+1, that's another reason why devlink(netlink) is better.
hi Jason, do you have any materials or sample code about devlink, so we can have a good study of it? I found some kernel docs about it but my preliminary study didn't show me the advantage of devlink.
CC Jiri and Parav for a better answer for this.
My understanding is that the following advantages are obvious (as I replied in another thread):
- existing users (NIC, crypto, SCSI, ib), mature and stable - much better error reporting (ext_ack other than string or errno) - namespace aware - do not couple with kobject
Thanks
Thanks Yan
Wed, Aug 05, 2020 at 04:41:54AM CEST, jasowang@redhat.com wrote:
On 2020/8/5 上午10:16, Yan Zhao wrote:
On Wed, Aug 05, 2020 at 10:22:15AM +0800, Jason Wang wrote:
On 2020/8/5 上午12:35, Cornelia Huck wrote:
[sorry about not chiming in earlier]
On Wed, 29 Jul 2020 16:05:03 +0800 Yan Zhao yan.y.zhao@intel.com wrote:
On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote:
(...)
Based on the feedback we've received, the previously proposed interface is not viable. I think there's agreement that the user needs to be able to parse and interpret the version information. Using json seems viable, but I don't know if it's the best option. Is there any precedent of markup strings returned via sysfs we could follow?
I don't think encoding complex information in a sysfs file is a viable approach. Quoting Documentation/filesystems/sysfs.rst:
"Attributes should be ASCII text files, preferably with only one value per file. It is noted that it may not be efficient to contain only one value per file, so it is socially acceptable to express an array of values of the same type. Mixing types, expressing multiple lines of data, and doing fancy formatting of data is heavily frowned upon."
Even though this is an older file, I think these restrictions still apply.
+1, that's another reason why devlink(netlink) is better.
hi Jason, do you have any materials or sample code about devlink, so we can have a good study of it? I found some kernel docs about it but my preliminary study didn't show me the advantage of devlink.
CC Jiri and Parav for a better answer for this.
My understanding is that the following advantages are obvious (as I replied in another thread):
- existing users (NIC, crypto, SCSI, ib), mature and stable
- much better error reporting (ext_ack other than string or errno)
- namespace aware
- do not couple with kobject
Jason, what is your use case?
Thanks
Thanks Yan
On 2020/8/5 下午3:56, Jiri Pirko wrote:
Wed, Aug 05, 2020 at 04:41:54AM CEST, jasowang@redhat.com wrote:
On 2020/8/5 上午10:16, Yan Zhao wrote:
On Wed, Aug 05, 2020 at 10:22:15AM +0800, Jason Wang wrote:
On 2020/8/5 上午12:35, Cornelia Huck wrote:
[sorry about not chiming in earlier]
On Wed, 29 Jul 2020 16:05:03 +0800 Yan Zhao yan.y.zhao@intel.com wrote:
On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote:
(...)
> Based on the feedback we've received, the previously proposed interface > is not viable. I think there's agreement that the user needs to be > able to parse and interpret the version information. Using json seems > viable, but I don't know if it's the best option. Is there any > precedent of markup strings returned via sysfs we could follow?
I don't think encoding complex information in a sysfs file is a viable approach. Quoting Documentation/filesystems/sysfs.rst:
"Attributes should be ASCII text files, preferably with only one value per file. It is noted that it may not be efficient to contain only one value per file, so it is socially acceptable to express an array of values of the same type. Mixing types, expressing multiple lines of data, and doing fancy formatting of data is heavily frowned upon."
Even though this is an older file, I think these restrictions still apply.
+1, that's another reason why devlink(netlink) is better.
hi Jason, do you have any materials or sample code about devlink, so we can have a good study of it? I found some kernel docs about it but my preliminary study didn't show me the advantage of devlink.
CC Jiri and Parav for a better answer for this.
My understanding is that the following advantages are obvious (as I replied in another thread):
- existing users (NIC, crypto, SCSI, ib), mature and stable
- much better error reporting (ext_ack other than string or errno)
- namespace aware
- do not couple with kobject
Jason, what is your use case?
I think the use case is to report device compatibility for live migration. Yan proposed a simple sysfs based migration version first, but it looks not sufficient and something based on JSON is discussed.
Yan, can you help to summarize the discussion so far for Jiri as a reference?
Thanks
Thanks
Thanks Yan
On Wed, Aug 05, 2020 at 04:02:48PM +0800, Jason Wang wrote:
On 2020/8/5 下午3:56, Jiri Pirko wrote:
Wed, Aug 05, 2020 at 04:41:54AM CEST, jasowang@redhat.com wrote:
On 2020/8/5 上午10:16, Yan Zhao wrote:
On Wed, Aug 05, 2020 at 10:22:15AM +0800, Jason Wang wrote:
On 2020/8/5 上午12:35, Cornelia Huck wrote:
[sorry about not chiming in earlier]
On Wed, 29 Jul 2020 16:05:03 +0800 Yan Zhao yan.y.zhao@intel.com wrote:
> On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote: (...)
> > Based on the feedback we've received, the previously proposed interface > > is not viable. I think there's agreement that the user needs to be > > able to parse and interpret the version information. Using json seems > > viable, but I don't know if it's the best option. Is there any > > precedent of markup strings returned via sysfs we could follow? I don't think encoding complex information in a sysfs file is a viable approach. Quoting Documentation/filesystems/sysfs.rst:
"Attributes should be ASCII text files, preferably with only one value per file. It is noted that it may not be efficient to contain only one value per file, so it is socially acceptable to express an array of values of the same type. Mixing types, expressing multiple lines of data, and doing fancy formatting of data is heavily frowned upon."
Even though this is an older file, I think these restrictions still apply.
+1, that's another reason why devlink(netlink) is better.
hi Jason, do you have any materials or sample code about devlink, so we can have a good study of it? I found some kernel docs about it but my preliminary study didn't show me the advantage of devlink.
CC Jiri and Parav for a better answer for this.
My understanding is that the following advantages are obvious (as I replied in another thread):
- existing users (NIC, crypto, SCSI, ib), mature and stable
- much better error reporting (ext_ack other than string or errno)
- namespace aware
- do not couple with kobject
Jason, what is your use case?
I think the use case is to report device compatibility for live migration. Yan proposed a simple sysfs based migration version first, but it looks not sufficient and something based on JSON is discussed.
Yan, can you help to summarize the discussion so far for Jiri as a reference?
yes. we are currently defining an device live migration compatibility interface in order to let user space like openstack and libvirt knows which two devices are live migration compatible. currently the devices include mdev (a kernel emulated virtual device) and physical devices (e.g. a VF of a PCI SRIOV device).
the attributes we want user space to compare including common attribues: device_api: vfio-pci, vfio-ccw... mdev_type: mdev type of mdev or similar signature for physical device It specifies a device's hardware capability. e.g. i915-GVTg_V5_4 means it's of 1/4 of a gen9 Intel graphics device. software_version: device driver's version. in <major>.<minor>[.bugfix] scheme, where there is no compatibility across major versions, minor versions have forward compatibility (ex. 1-> 2 is ok, 2 -> 1 is not) and bugfix version number indicates some degree of internal improvement that is not visible to the user in terms of features or compatibility,
vendor specific attributes: each vendor may define different attributes device id : device id of a physical devices or mdev's parent pci device. it could be equal to pci id for pci devices aggregator: used together with mdev_type. e.g. aggregator=2 together with i915-GVTg_V5_4 means 2*1/4=1/2 of a gen9 Intel graphics device. remote_url: for a local NVMe VF, it may be configured with a remote url of a remote storage and all data is stored in the remote side specified by the remote url. ...
Comparing those attributes by user space alone is not an easy job, as it can't simply assume an equal relationship between source attributes and target attributes. e.g. for a source device of mdev_type=i915-GVTg_V5_4,aggregator=2, (1/2 of gen9), it actually could find a compatible device of mdev_type=i915-GVTg_V5_8,aggregator=4 (also 1/2 of gen9), if mdev_type of i915-GVTg_V5_4 is not available in the target machine.
So, in our current proposal, we want to create two sysfs attributes under a device sysfs node. /sys/<path to device>/migration/self /sys/<path to device>/migration/compatible
#cat /sys/<path to device>/migration/self device_type=vfio_pci mdev_type=i915-GVTg_V5_4 device_id=8086591d aggregator=2 software_version=1.0.0
#cat /sys/<path to device>/migration/compatible device_type=vfio_pci mdev_type=i915-GVTg_V5_{val1:int:2,4,8} device_id=8086591d aggregator={val1}/2 software_version=1.0.0
The /sys/<path to device>/migration/self specifies self attributes of a device. The /sys/<path to device>/migration/compatible specifies the list of compatible devices of a device. as in the example, compatible devices could have device_type == vfio_pci && device_id == 8086591d && software_version == 1.0.0 && ( (mdev_type of i915-GVTg_V5_2 && aggregator==1) || (mdev_type of i915-GVTg_V5_4 && aggregator==2) || (mdev_type of i915-GVTg_V5_8 && aggregator=4) )
by comparing whether a target device is in compatible list of source device, the user space can know whether a two devices are live migration compatible.
Additional notes: 1)software_version in the compatible list may not be necessary as it already has a major.minor.bugfix scheme. 2)for vendor attribute like remote_url, it may not be statically assigned and could be changed with a device interface.
So, as Cornelia pointed that it's not good to use complex format in a sysfs attribute, we'd like to know whether there're other good ways to our use case, e.g. splitting a single attribute to multiple simple sysfs attributes as what Cornelia suggested or devlink that Jason has strongly recommended.
Thanks Yan
Wed, Aug 05, 2020 at 11:33:38AM CEST, yan.y.zhao@intel.com wrote:
On Wed, Aug 05, 2020 at 04:02:48PM +0800, Jason Wang wrote:
On 2020/8/5 下午3:56, Jiri Pirko wrote:
Wed, Aug 05, 2020 at 04:41:54AM CEST, jasowang@redhat.com wrote:
On 2020/8/5 上午10:16, Yan Zhao wrote:
On Wed, Aug 05, 2020 at 10:22:15AM +0800, Jason Wang wrote:
On 2020/8/5 上午12:35, Cornelia Huck wrote: > [sorry about not chiming in earlier] > > On Wed, 29 Jul 2020 16:05:03 +0800 > Yan Zhao yan.y.zhao@intel.com wrote: > > > On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote: > (...) > > > > Based on the feedback we've received, the previously proposed interface > > > is not viable. I think there's agreement that the user needs to be > > > able to parse and interpret the version information. Using json seems > > > viable, but I don't know if it's the best option. Is there any > > > precedent of markup strings returned via sysfs we could follow? > I don't think encoding complex information in a sysfs file is a viable > approach. Quoting Documentation/filesystems/sysfs.rst: > > "Attributes should be ASCII text files, preferably with only one value > per file. It is noted that it may not be efficient to contain only one > value per file, so it is socially acceptable to express an array of > values of the same type. > Mixing types, expressing multiple lines of data, and doing fancy > formatting of data is heavily frowned upon." > > Even though this is an older file, I think these restrictions still > apply. +1, that's another reason why devlink(netlink) is better.
hi Jason, do you have any materials or sample code about devlink, so we can have a good study of it? I found some kernel docs about it but my preliminary study didn't show me the advantage of devlink.
CC Jiri and Parav for a better answer for this.
My understanding is that the following advantages are obvious (as I replied in another thread):
- existing users (NIC, crypto, SCSI, ib), mature and stable
- much better error reporting (ext_ack other than string or errno)
- namespace aware
- do not couple with kobject
Jason, what is your use case?
I think the use case is to report device compatibility for live migration. Yan proposed a simple sysfs based migration version first, but it looks not sufficient and something based on JSON is discussed.
Yan, can you help to summarize the discussion so far for Jiri as a reference?
yes. we are currently defining an device live migration compatibility interface in order to let user space like openstack and libvirt knows which two devices are live migration compatible. currently the devices include mdev (a kernel emulated virtual device) and physical devices (e.g. a VF of a PCI SRIOV device).
the attributes we want user space to compare including common attribues: device_api: vfio-pci, vfio-ccw... mdev_type: mdev type of mdev or similar signature for physical device It specifies a device's hardware capability. e.g. i915-GVTg_V5_4 means it's of 1/4 of a gen9 Intel graphics device. software_version: device driver's version. in <major>.<minor>[.bugfix] scheme, where there is no compatibility across major versions, minor versions have forward compatibility (ex. 1-> 2 is ok, 2 -> 1 is not) and bugfix version number indicates some degree of internal improvement that is not visible to the user in terms of features or compatibility,
vendor specific attributes: each vendor may define different attributes device id : device id of a physical devices or mdev's parent pci device. it could be equal to pci id for pci devices aggregator: used together with mdev_type. e.g. aggregator=2 together with i915-GVTg_V5_4 means 2*1/4=1/2 of a gen9 Intel graphics device. remote_url: for a local NVMe VF, it may be configured with a remote url of a remote storage and all data is stored in the remote side specified by the remote url. ...
Comparing those attributes by user space alone is not an easy job, as it can't simply assume an equal relationship between source attributes and target attributes. e.g. for a source device of mdev_type=i915-GVTg_V5_4,aggregator=2, (1/2 of gen9), it actually could find a compatible device of mdev_type=i915-GVTg_V5_8,aggregator=4 (also 1/2 of gen9), if mdev_type of i915-GVTg_V5_4 is not available in the target machine.
So, in our current proposal, we want to create two sysfs attributes under a device sysfs node. /sys/<path to device>/migration/self /sys/<path to device>/migration/compatible
#cat /sys/<path to device>/migration/self device_type=vfio_pci mdev_type=i915-GVTg_V5_4 device_id=8086591d aggregator=2 software_version=1.0.0
#cat /sys/<path to device>/migration/compatible device_type=vfio_pci mdev_type=i915-GVTg_V5_{val1:int:2,4,8} device_id=8086591d aggregator={val1}/2 software_version=1.0.0
The /sys/<path to device>/migration/self specifies self attributes of a device. The /sys/<path to device>/migration/compatible specifies the list of compatible devices of a device. as in the example, compatible devices could have device_type == vfio_pci && device_id == 8086591d && software_version == 1.0.0 && ( (mdev_type of i915-GVTg_V5_2 && aggregator==1) || (mdev_type of i915-GVTg_V5_4 && aggregator==2) || (mdev_type of i915-GVTg_V5_8 && aggregator=4) )
by comparing whether a target device is in compatible list of source device, the user space can know whether a two devices are live migration compatible.
Additional notes: 1)software_version in the compatible list may not be necessary as it already has a major.minor.bugfix scheme. 2)for vendor attribute like remote_url, it may not be statically assigned and could be changed with a device interface.
So, as Cornelia pointed that it's not good to use complex format in a sysfs attribute, we'd like to know whether there're other good ways to our use case, e.g. splitting a single attribute to multiple simple sysfs attributes as what Cornelia suggested or devlink that Jason has strongly recommended.
Hi Yan.
Thanks for the explanation, I'm still fuzzy about the details. Anyway, I suggest you to check "devlink dev info" command we have implemented for multiple drivers. You can try netdevsim to test this. I think that the info you need to expose might be put there.
Devlink creates instance per-device. Specific device driver calls into devlink core to create the instance. What device do you have? What driver is it handled by?
Thanks Yan
On Wed, 2020-08-05 at 12:53 +0200, Jiri Pirko wrote:
Wed, Aug 05, 2020 at 11:33:38AM CEST, yan.y.zhao@intel.com wrote:
On Wed, Aug 05, 2020 at 04:02:48PM +0800, Jason Wang wrote:
On 2020/8/5 下午3:56, Jiri Pirko wrote:
Wed, Aug 05, 2020 at 04:41:54AM CEST, jasowang@redhat.com wrote:
On 2020/8/5 上午10:16, Yan Zhao wrote:
On Wed, Aug 05, 2020 at 10:22:15AM +0800, Jason Wang wrote: > On 2020/8/5 上午12:35, Cornelia Huck wrote: > > [sorry about not chiming in earlier] > > > > On Wed, 29 Jul 2020 16:05:03 +0800 > > Yan Zhao yan.y.zhao@intel.com wrote: > > > > > On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote: > > > > (...) > > > > > > Based on the feedback we've received, the previously proposed interface > > > > is not viable. I think there's agreement that the user needs to be > > > > able to parse and interpret the version information. Using json seems > > > > viable, but I don't know if it's the best option. Is there any > > > > precedent of markup strings returned via sysfs we could follow? > > > > I don't think encoding complex information in a sysfs file is a viable > > approach. Quoting Documentation/filesystems/sysfs.rst: > > > > "Attributes should be ASCII text files, preferably with only one value > > per file. It is noted that it may not be efficient to contain only one > > value per file, so it is socially acceptable to express an array of > > values of the same type. > > Mixing types, expressing multiple lines of data, and doing fancy > > formatting of data is heavily frowned upon." > > > > Even though this is an older file, I think these restrictions still > > apply. > > +1, that's another reason why devlink(netlink) is better. >
hi Jason, do you have any materials or sample code about devlink, so we can have a good study of it? I found some kernel docs about it but my preliminary study didn't show me the advantage of devlink.
CC Jiri and Parav for a better answer for this.
My understanding is that the following advantages are obvious (as I replied in another thread):
- existing users (NIC, crypto, SCSI, ib), mature and stable
- much better error reporting (ext_ack other than string or errno)
- namespace aware
- do not couple with kobject
Jason, what is your use case?
I think the use case is to report device compatibility for live migration. Yan proposed a simple sysfs based migration version first, but it looks not sufficient and something based on JSON is discussed.
Yan, can you help to summarize the discussion so far for Jiri as a reference?
yes. we are currently defining an device live migration compatibility interface in order to let user space like openstack and libvirt knows which two devices are live migration compatible. currently the devices include mdev (a kernel emulated virtual device) and physical devices (e.g. a VF of a PCI SRIOV device).
the attributes we want user space to compare including common attribues: device_api: vfio-pci, vfio-ccw... mdev_type: mdev type of mdev or similar signature for physical device It specifies a device's hardware capability. e.g. i915-GVTg_V5_4 means it's of 1/4 of a gen9 Intel graphics device.
by the way this nameing sceam works the opisite of how it would have expected i woudl have expected to i915-GVTg_V5 to be the same as i915-GVTg_V5_1 and i915-GVTg_V5_4 to use 4 times the amount of resouce as i915-GVTg_V5_1 not 1 quarter.
i would much rather see i915-GVTg_V5_4 express as aggreataor:i915-GVTg_V5=4 e.g. that it is 4 of the basic i915-GVTg_V5 type the invertion of the relationship makes this much harder to resonabout IMO.
if i915-GVTg_V5_8 and i915-GVTg_V5_4 are both actully claiming the same resouce and both can be used at the same time with your suggested nameing scemem i have have to fine the mdevtype with the largest value and store that then do math by devidign it by the suffix of the requested type every time i want to claim the resouce in our placement inventoies.
if we represent it the way i suggest we dont if it i915-GVTg_V5_8 i know its using 8 of i915-GVTg_V5 it makes it significantly simpler.
software_version: device driver's version. in <major>.<minor>[.bugfix] scheme, where there is no compatibility across major versions, minor versions have forward compatibility (ex. 1-> 2 is ok, 2 -> 1 is not) and bugfix version number indicates some degree of internal improvement that is not visible to the user in terms of features or compatibility,
vendor specific attributes: each vendor may define different attributes device id : device id of a physical devices or mdev's parent pci device. it could be equal to pci id for pci devices aggregator: used together with mdev_type. e.g. aggregator=2 together with i915-GVTg_V5_4 means 2*1/4=1/2 of a gen9 Intel graphics device. remote_url: for a local NVMe VF, it may be configured with a remote url of a remote storage and all data is stored in the remote side specified by the remote url. ...
just a minor not that i find ^ much more simmple to understand then the current proposal with self and compatiable. if i have well defiend attibute that i can parse and understand that allow me to calulate the what is and is not compatible that is likely going to more useful as you wont have to keep maintianing a list of other compatible devices every time a new sku is released.
in anycase thank for actully shareing ^ as it make it simpler to reson about what you have previously proposed.
Comparing those attributes by user space alone is not an easy job, as it can't simply assume an equal relationship between source attributes and target attributes. e.g. for a source device of mdev_type=i915-GVTg_V5_4,aggregator=2, (1/2 of gen9), it actually could find a compatible device of mdev_type=i915-GVTg_V5_8,aggregator=4 (also 1/2 of gen9), if mdev_type of i915-GVTg_V5_4 is not available in the target machine.
So, in our current proposal, we want to create two sysfs attributes under a device sysfs node. /sys/<path to device>/migration/self /sys/<path to device>/migration/compatible
#cat /sys/<path to device>/migration/self device_type=vfio_pci mdev_type=i915-GVTg_V5_4 device_id=8086591d aggregator=2 software_version=1.0.0
#cat /sys/<path to device>/migration/compatible device_type=vfio_pci mdev_type=i915-GVTg_V5_{val1:int:2,4,8} device_id=8086591d aggregator={val1}/2 software_version=1.0.0
The /sys/<path to device>/migration/self specifies self attributes of a device. The /sys/<path to device>/migration/compatible specifies the list of compatible devices of a device. as in the example, compatible devices could have device_type == vfio_pci && device_id == 8086591d && software_version == 1.0.0 && ( (mdev_type of i915-GVTg_V5_2 && aggregator==1) || (mdev_type of i915-GVTg_V5_4 && aggregator==2) || (mdev_type of i915-GVTg_V5_8 && aggregator=4) )
by comparing whether a target device is in compatible list of source device, the user space can know whether a two devices are live migration compatible.
Additional notes: 1)software_version in the compatible list may not be necessary as it already has a major.minor.bugfix scheme. 2)for vendor attribute like remote_url, it may not be statically assigned and could be changed with a device interface.
So, as Cornelia pointed that it's not good to use complex format in a sysfs attribute, we'd like to know whether there're other good ways to our use case, e.g. splitting a single attribute to multiple simple sysfs attributes as what Cornelia suggested or devlink that Jason has strongly recommended.
Hi Yan.
Thanks for the explanation, I'm still fuzzy about the details. Anyway, I suggest you to check "devlink dev info" command we have implemented for multiple drivers.
is devlink exposed as a filesytem we can read with just open? openstack will likely try to leverage libvirt to get this info but when we cant its much simpler to read sysfs then it is to take a a depenency on a commandline too and have to fork shell to execute it and parse the cli output. pyroute2 which we use in some openstack poject has basic python binding for devlink but im not sure how complete it is as i think its relitivly new addtion. if we need to take a dependcy we will but that would be a drawback fo devlink not that that is a large one just something to keep in mind.
You can try netdevsim to test this. I think that the info you need to expose might be put there.
Devlink creates instance per-device. Specific device driver calls into devlink core to create the instance. What device do you have? What driver is it handled by?
Thanks Yan
On Wed, 05 Aug 2020 12:35:01 +0100 Sean Mooney smooney@redhat.com wrote:
On Wed, 2020-08-05 at 12:53 +0200, Jiri Pirko wrote:
Wed, Aug 05, 2020 at 11:33:38AM CEST, yan.y.zhao@intel.com wrote:
(...)
software_version: device driver's version. in <major>.<minor>[.bugfix] scheme, where there is no compatibility across major versions, minor versions have forward compatibility (ex. 1-> 2 is ok, 2 -> 1 is not) and bugfix version number indicates some degree of internal improvement that is not visible to the user in terms of features or compatibility,
vendor specific attributes: each vendor may define different attributes device id : device id of a physical devices or mdev's parent pci device. it could be equal to pci id for pci devices aggregator: used together with mdev_type. e.g. aggregator=2 together with i915-GVTg_V5_4 means 2*1/4=1/2 of a gen9 Intel graphics device. remote_url: for a local NVMe VF, it may be configured with a remote url of a remote storage and all data is stored in the remote side specified by the remote url. ...
just a minor not that i find ^ much more simmple to understand then the current proposal with self and compatiable. if i have well defiend attibute that i can parse and understand that allow me to calulate the what is and is not compatible that is likely going to more useful as you wont have to keep maintianing a list of other compatible devices every time a new sku is released.
in anycase thank for actully shareing ^ as it make it simpler to reson about what you have previously proposed.
So, what would be the most helpful format? A 'software_version' field that follows the conventions outlined above, and other (possibly optional) fields that have to match?
(...)
Thanks for the explanation, I'm still fuzzy about the details. Anyway, I suggest you to check "devlink dev info" command we have implemented for multiple drivers.
is devlink exposed as a filesytem we can read with just open? openstack will likely try to leverage libvirt to get this info but when we cant its much simpler to read sysfs then it is to take a a depenency on a commandline too and have to fork shell to execute it and parse the cli output. pyroute2 which we use in some openstack poject has basic python binding for devlink but im not sure how complete it is as i think its relitivly new addtion. if we need to take a dependcy we will but that would be a drawback fo devlink not that that is a large one just something to keep in mind.
A devlinkfs, maybe? At least for reading information (IIUC, "devlink dev info" is only about information retrieval, right?)
On Fri, 7 Aug 2020 13:59:42 +0200 Cornelia Huck cohuck@redhat.com wrote:
On Wed, 05 Aug 2020 12:35:01 +0100 Sean Mooney smooney@redhat.com wrote:
On Wed, 2020-08-05 at 12:53 +0200, Jiri Pirko wrote:
Wed, Aug 05, 2020 at 11:33:38AM CEST, yan.y.zhao@intel.com wrote:
(...)
software_version: device driver's version. in <major>.<minor>[.bugfix] scheme, where there is no compatibility across major versions, minor versions have forward compatibility (ex. 1-> 2 is ok, 2 -> 1 is not) and bugfix version number indicates some degree of internal improvement that is not visible to the user in terms of features or compatibility,
vendor specific attributes: each vendor may define different attributes device id : device id of a physical devices or mdev's parent pci device. it could be equal to pci id for pci devices aggregator: used together with mdev_type. e.g. aggregator=2 together with i915-GVTg_V5_4 means 2*1/4=1/2 of a gen9 Intel graphics device. remote_url: for a local NVMe VF, it may be configured with a remote url of a remote storage and all data is stored in the remote side specified by the remote url. ...
just a minor not that i find ^ much more simmple to understand then the current proposal with self and compatiable. if i have well defiend attibute that i can parse and understand that allow me to calulate the what is and is not compatible that is likely going to more useful as you wont have to keep maintianing a list of other compatible devices every time a new sku is released.
in anycase thank for actully shareing ^ as it make it simpler to reson about what you have previously proposed.
So, what would be the most helpful format? A 'software_version' field that follows the conventions outlined above, and other (possibly optional) fields that have to match?
Just to get a different perspective, I've been trying to come up with what would be useful for a very different kind of device, namely vfio-ccw. (Adding Eric to cc: for that.)
software_version makes sense for everybody, so it should be a standard attribute.
For the vfio-ccw type, we have only one vendor driver (vfio-ccw_IO).
Given a subchannel A, we want to make sure that subchannel B has a reasonable chance of being compatible. I guess that means:
- same subchannel type (I/O) - same chpid type (e.g. all FICON; I assume there are no 'mixed' setups -- Eric?) - same number of chpids? Maybe we can live without that and just inject some machine checks, I don't know. Same chpid numbers is something we cannot guarantee, especially if we want to migrate cross-CEC (to another machine.)
Other possibly interesting information is not available at the subchannel level (vfio-ccw is a subchannel driver.)
So, looking at a concrete subchannel on one of my machines, it would look something like the following:
<common> software_version=1.0.0 type=vfio-ccw <-- would be vfio-pci on the example above <vfio-ccw specific> subchannel_type=0 <vfio-ccw_IO specific> chpid_type=0x1a chpid_mask=0xf0 <-- not sure if needed/wanted
Does that make sense?
On 8/13/20 11:33 AM, Cornelia Huck wrote:
On Fri, 7 Aug 2020 13:59:42 +0200 Cornelia Huck cohuck@redhat.com wrote:
On Wed, 05 Aug 2020 12:35:01 +0100 Sean Mooney smooney@redhat.com wrote:
On Wed, 2020-08-05 at 12:53 +0200, Jiri Pirko wrote:
Wed, Aug 05, 2020 at 11:33:38AM CEST, yan.y.zhao@intel.com wrote:
(...)
software_version: device driver's version. in <major>.<minor>[.bugfix] scheme, where there is no compatibility across major versions, minor versions have forward compatibility (ex. 1-> 2 is ok, 2 -> 1 is not) and bugfix version number indicates some degree of internal improvement that is not visible to the user in terms of features or compatibility,
vendor specific attributes: each vendor may define different attributes device id : device id of a physical devices or mdev's parent pci device. it could be equal to pci id for pci devices aggregator: used together with mdev_type. e.g. aggregator=2 together with i915-GVTg_V5_4 means 2*1/4=1/2 of a gen9 Intel graphics device. remote_url: for a local NVMe VF, it may be configured with a remote url of a remote storage and all data is stored in the remote side specified by the remote url. ...
just a minor not that i find ^ much more simmple to understand then the current proposal with self and compatiable. if i have well defiend attibute that i can parse and understand that allow me to calulate the what is and is not compatible that is likely going to more useful as you wont have to keep maintianing a list of other compatible devices every time a new sku is released.
in anycase thank for actully shareing ^ as it make it simpler to reson about what you have previously proposed.
So, what would be the most helpful format? A 'software_version' field that follows the conventions outlined above, and other (possibly optional) fields that have to match?
Just to get a different perspective, I've been trying to come up with what would be useful for a very different kind of device, namely vfio-ccw. (Adding Eric to cc: for that.)
software_version makes sense for everybody, so it should be a standard attribute.
For the vfio-ccw type, we have only one vendor driver (vfio-ccw_IO).
Given a subchannel A, we want to make sure that subchannel B has a reasonable chance of being compatible. I guess that means:
- same subchannel type (I/O)
- same chpid type (e.g. all FICON; I assume there are no 'mixed' setups -- Eric?)
Correct.
- same number of chpids? Maybe we can live without that and just inject some machine checks, I don't know. Same chpid numbers is something we cannot guarantee, especially if we want to migrate cross-CEC (to another machine.)
I think we'd live without it, because I wouldn't expect it to be consistent between systems.
Other possibly interesting information is not available at the subchannel level (vfio-ccw is a subchannel driver.)
I presume you're alluding to the DASD uid (dasdinfo -x) here?
So, looking at a concrete subchannel on one of my machines, it would look something like the following:
<common> software_version=1.0.0 type=vfio-ccw <-- would be vfio-pci on the example above <vfio-ccw specific> subchannel_type=0 <vfio-ccw_IO specific> chpid_type=0x1a chpid_mask=0xf0 <-- not sure if needed/wanted
Does that make sense?
On Thu, 13 Aug 2020 15:02:53 -0400 Eric Farman farman@linux.ibm.com wrote:
On 8/13/20 11:33 AM, Cornelia Huck wrote:
On Fri, 7 Aug 2020 13:59:42 +0200 Cornelia Huck cohuck@redhat.com wrote:
On Wed, 05 Aug 2020 12:35:01 +0100 Sean Mooney smooney@redhat.com wrote:
On Wed, 2020-08-05 at 12:53 +0200, Jiri Pirko wrote:
Wed, Aug 05, 2020 at 11:33:38AM CEST, yan.y.zhao@intel.com wrote:
(...)
software_version: device driver's version. in <major>.<minor>[.bugfix] scheme, where there is no compatibility across major versions, minor versions have forward compatibility (ex. 1-> 2 is ok, 2 -> 1 is not) and bugfix version number indicates some degree of internal improvement that is not visible to the user in terms of features or compatibility,
vendor specific attributes: each vendor may define different attributes device id : device id of a physical devices or mdev's parent pci device. it could be equal to pci id for pci devices aggregator: used together with mdev_type. e.g. aggregator=2 together with i915-GVTg_V5_4 means 2*1/4=1/2 of a gen9 Intel graphics device. remote_url: for a local NVMe VF, it may be configured with a remote url of a remote storage and all data is stored in the remote side specified by the remote url. ...
just a minor not that i find ^ much more simmple to understand then the current proposal with self and compatiable. if i have well defiend attibute that i can parse and understand that allow me to calulate the what is and is not compatible that is likely going to more useful as you wont have to keep maintianing a list of other compatible devices every time a new sku is released.
in anycase thank for actully shareing ^ as it make it simpler to reson about what you have previously proposed.
So, what would be the most helpful format? A 'software_version' field that follows the conventions outlined above, and other (possibly optional) fields that have to match?
Just to get a different perspective, I've been trying to come up with what would be useful for a very different kind of device, namely vfio-ccw. (Adding Eric to cc: for that.)
software_version makes sense for everybody, so it should be a standard attribute.
For the vfio-ccw type, we have only one vendor driver (vfio-ccw_IO).
Given a subchannel A, we want to make sure that subchannel B has a reasonable chance of being compatible. I guess that means:
- same subchannel type (I/O)
- same chpid type (e.g. all FICON; I assume there are no 'mixed' setups -- Eric?)
Correct.
- same number of chpids? Maybe we can live without that and just inject some machine checks, I don't know. Same chpid numbers is something we cannot guarantee, especially if we want to migrate cross-CEC (to another machine.)
I think we'd live without it, because I wouldn't expect it to be consistent between systems.
Yes, and the guest needs to be able to deal with changing path configurations anyway.
Other possibly interesting information is not available at the subchannel level (vfio-ccw is a subchannel driver.)
I presume you're alluding to the DASD uid (dasdinfo -x) here?
Yes, or the even more basic Sense ID information.
So, looking at a concrete subchannel on one of my machines, it would look something like the following:
<common> software_version=1.0.0 type=vfio-ccw <-- would be vfio-pci on the example above <vfio-ccw specific> subchannel_type=0 <vfio-ccw_IO specific> chpid_type=0x1a chpid_mask=0xf0 <-- not sure if needed/wanted
Let's just drop the chpid_mask here.
Does that make sense?
Would be interesting if someone could come up with some possible information for a third type of device.
On Wed, Aug 05, 2020 at 12:53:19PM +0200, Jiri Pirko wrote:
Wed, Aug 05, 2020 at 11:33:38AM CEST, yan.y.zhao@intel.com wrote:
On Wed, Aug 05, 2020 at 04:02:48PM +0800, Jason Wang wrote:
On 2020/8/5 下午3:56, Jiri Pirko wrote:
Wed, Aug 05, 2020 at 04:41:54AM CEST, jasowang@redhat.com wrote:
On 2020/8/5 上午10:16, Yan Zhao wrote:
On Wed, Aug 05, 2020 at 10:22:15AM +0800, Jason Wang wrote: > On 2020/8/5 上午12:35, Cornelia Huck wrote: > > [sorry about not chiming in earlier] > > > > On Wed, 29 Jul 2020 16:05:03 +0800 > > Yan Zhao yan.y.zhao@intel.com wrote: > > > > > On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote: > > (...) > > > > > > Based on the feedback we've received, the previously proposed interface > > > > is not viable. I think there's agreement that the user needs to be > > > > able to parse and interpret the version information. Using json seems > > > > viable, but I don't know if it's the best option. Is there any > > > > precedent of markup strings returned via sysfs we could follow? > > I don't think encoding complex information in a sysfs file is a viable > > approach. Quoting Documentation/filesystems/sysfs.rst: > > > > "Attributes should be ASCII text files, preferably with only one value > > per file. It is noted that it may not be efficient to contain only one > > value per file, so it is socially acceptable to express an array of > > values of the same type. > > Mixing types, expressing multiple lines of data, and doing fancy > > formatting of data is heavily frowned upon." > > > > Even though this is an older file, I think these restrictions still > > apply. > +1, that's another reason why devlink(netlink) is better. > hi Jason, do you have any materials or sample code about devlink, so we can have a good study of it? I found some kernel docs about it but my preliminary study didn't show me the advantage of devlink.
CC Jiri and Parav for a better answer for this.
My understanding is that the following advantages are obvious (as I replied in another thread):
- existing users (NIC, crypto, SCSI, ib), mature and stable
- much better error reporting (ext_ack other than string or errno)
- namespace aware
- do not couple with kobject
Jason, what is your use case?
I think the use case is to report device compatibility for live migration. Yan proposed a simple sysfs based migration version first, but it looks not sufficient and something based on JSON is discussed.
Yan, can you help to summarize the discussion so far for Jiri as a reference?
yes. we are currently defining an device live migration compatibility interface in order to let user space like openstack and libvirt knows which two devices are live migration compatible. currently the devices include mdev (a kernel emulated virtual device) and physical devices (e.g. a VF of a PCI SRIOV device).
the attributes we want user space to compare including common attribues: device_api: vfio-pci, vfio-ccw... mdev_type: mdev type of mdev or similar signature for physical device It specifies a device's hardware capability. e.g. i915-GVTg_V5_4 means it's of 1/4 of a gen9 Intel graphics device. software_version: device driver's version. in <major>.<minor>[.bugfix] scheme, where there is no compatibility across major versions, minor versions have forward compatibility (ex. 1-> 2 is ok, 2 -> 1 is not) and bugfix version number indicates some degree of internal improvement that is not visible to the user in terms of features or compatibility,
vendor specific attributes: each vendor may define different attributes device id : device id of a physical devices or mdev's parent pci device. it could be equal to pci id for pci devices aggregator: used together with mdev_type. e.g. aggregator=2 together with i915-GVTg_V5_4 means 2*1/4=1/2 of a gen9 Intel graphics device. remote_url: for a local NVMe VF, it may be configured with a remote url of a remote storage and all data is stored in the remote side specified by the remote url. ...
Comparing those attributes by user space alone is not an easy job, as it can't simply assume an equal relationship between source attributes and target attributes. e.g. for a source device of mdev_type=i915-GVTg_V5_4,aggregator=2, (1/2 of gen9), it actually could find a compatible device of mdev_type=i915-GVTg_V5_8,aggregator=4 (also 1/2 of gen9), if mdev_type of i915-GVTg_V5_4 is not available in the target machine.
So, in our current proposal, we want to create two sysfs attributes under a device sysfs node. /sys/<path to device>/migration/self /sys/<path to device>/migration/compatible
#cat /sys/<path to device>/migration/self device_type=vfio_pci mdev_type=i915-GVTg_V5_4 device_id=8086591d aggregator=2 software_version=1.0.0
#cat /sys/<path to device>/migration/compatible device_type=vfio_pci mdev_type=i915-GVTg_V5_{val1:int:2,4,8} device_id=8086591d aggregator={val1}/2 software_version=1.0.0
The /sys/<path to device>/migration/self specifies self attributes of a device. The /sys/<path to device>/migration/compatible specifies the list of compatible devices of a device. as in the example, compatible devices could have device_type == vfio_pci && device_id == 8086591d && software_version == 1.0.0 && ( (mdev_type of i915-GVTg_V5_2 && aggregator==1) || (mdev_type of i915-GVTg_V5_4 && aggregator==2) || (mdev_type of i915-GVTg_V5_8 && aggregator=4) )
by comparing whether a target device is in compatible list of source device, the user space can know whether a two devices are live migration compatible.
Additional notes: 1)software_version in the compatible list may not be necessary as it already has a major.minor.bugfix scheme. 2)for vendor attribute like remote_url, it may not be statically assigned and could be changed with a device interface.
So, as Cornelia pointed that it's not good to use complex format in a sysfs attribute, we'd like to know whether there're other good ways to our use case, e.g. splitting a single attribute to multiple simple sysfs attributes as what Cornelia suggested or devlink that Jason has strongly recommended.
Hi Yan.
Hi Jiri,
Thanks for the explanation, I'm still fuzzy about the details. Anyway, I suggest you to check "devlink dev info" command we have implemented for multiple drivers. You can try netdevsim to test this. I think that the info you need to expose might be put there.
do you mean drivers/net/netdevsim/ ?
Devlink creates instance per-device. Specific device driver calls into devlink core to create the instance. What device do you have? What
the devlink core is net/core/devlink.c ?
driver is it handled by?
It looks that the devlink is for network device specific, and in devlink.h, it says include/uapi/linux/devlink.h - Network physical device Netlink interface, I feel like it's not very appropriate for a GPU driver to use this interface. Is that right?
Thanks Yan
On 2020/8/10 下午3:46, Yan Zhao wrote:
driver is it handled by?
It looks that the devlink is for network device specific, and in devlink.h, it says include/uapi/linux/devlink.h - Network physical device Netlink interface,
Actually not, I think there used to have some discussion last year and the conclusion is to remove this comment.
It supports IB and probably vDPA in the future.
I feel like it's not very appropriate for a GPU driver to use this interface. Is that right?
I think not though most of the users are switch or ethernet devices. It doesn't prevent you from inventing new abstractions.
Note that devlink is based on netlink, netlink has been widely used by various subsystems other than networking.
Thanks
Thanks Yan
On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:
On 2020/8/10 下午3:46, Yan Zhao wrote:
driver is it handled by?
It looks that the devlink is for network device specific, and in devlink.h, it says include/uapi/linux/devlink.h - Network physical device Netlink interface,
Actually not, I think there used to have some discussion last year and the conclusion is to remove this comment.
It supports IB and probably vDPA in the future.
hmm... sorry, I didn't find the referred discussion. only below discussion regarding to why to add devlink.
https://www.mail-archive.com/netdev@vger.kernel.org/msg95801.html >This doesn't seem to be too much related to networking? Why can't something >like this be in sysfs? It is related to networking quite bit. There has been couple of iteration of this, including sysfs and configfs implementations. There has been a consensus reached that this should be done by netlink. I believe netlink is really the best for this purpose. Sysfs is not a good idea
https://www.mail-archive.com/netdev@vger.kernel.org/msg96102.html >there is already a way to change eth/ib via >echo 'eth' > /sys/bus/pci/drivers/mlx4_core/0000:02:00.0/mlx4_port1 > >sounds like this is another way to achieve the same? It is. However the current way is driver-specific, not correct. For mlx5, we need the same, it cannot be done in this way. Do devlink is the correct way to go.
https://lwn.net/Articles/674867/ There a is need for some userspace API that would allow to expose things that are not directly related to any device class like net_device of ib_device, but rather chip-wide/switch-ASIC-wide stuff.
Use cases: 1) get/set of port type (Ethernet/InfiniBand) 2) monitoring of hardware messages to and from chip 3) setting up port splitters - split port into multiple ones and squash again, enables usage of splitter cable 4) setting up shared buffers - shared among multiple ports within one chip
we actually can also retrieve the same information through sysfs, .e.g
|- [path to device] |--- migration | |--- self | | |---device_api | | |---mdev_type | | |---software_version | | |---device_id | | |---aggregator | |--- compatible | | |---device_api | | |---mdev_type | | |---software_version | | |---device_id | | |---aggregator
I feel like it's not very appropriate for a GPU driver to use this interface. Is that right?
I think not though most of the users are switch or ethernet devices. It doesn't prevent you from inventing new abstractions.
so need to patch devlink core and the userspace devlink tool? e.g. devlink migration
Note that devlink is based on netlink, netlink has been widely used by various subsystems other than networking.
the advantage of netlink I see is that it can monitor device status and notify upper layer that migration database needs to get updated. But not sure whether openstack would like to use this capability. As Sean said, it's heavy for openstack. it's heavy for vendor driver as well :)
And devlink monitor now listens the notification and dumps the state changes. If we want to use it, need to let it forward the notification and dumped info to openstack, right?
Thanks Yan
On Fri, 2020-08-14 at 13:16 +0800, Yan Zhao wrote:
On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:
On 2020/8/10 下午3:46, Yan Zhao wrote:
driver is it handled by?
It looks that the devlink is for network device specific, and in devlink.h, it says include/uapi/linux/devlink.h - Network physical device Netlink interface,
Actually not, I think there used to have some discussion last year and the conclusion is to remove this comment.
It supports IB and probably vDPA in the future.
hmm... sorry, I didn't find the referred discussion. only below discussion regarding to why to add devlink.
https://www.mail-archive.com/netdev@vger.kernel.org/msg95801.html
This doesn't seem to be too much related to networking? Why can't something like this be in sysfs?
It is related to networking quite bit. There has been couple of iteration of this, including sysfs and configfs implementations. There has been a consensus reached that this should be done by netlink. I believe netlink is really the best for this purpose. Sysfs is not a good idea
https://www.mail-archive.com/netdev@vger.kernel.org/msg96102.html
there is already a way to change eth/ib via echo 'eth' > /sys/bus/pci/drivers/mlx4_core/0000:02:00.0/mlx4_port1
sounds like this is another way to achieve the same?
It is. However the current way is driver-specific, not correct. For mlx5, we need the same, it cannot be done in this way. Do devlink is the correct way to go.
im not sure i agree with that. standardising a filesystem based api that is used across all vendors is also a valid option. that said if devlink is the right choice form a kerenl perspective by all means use it but i have not heard a convincing argument for why it actually better. with tthat said we have been uing tools like ethtool to manage aspect of nics for decades so its not that strange an idea to use a tool and binary protocoal rather then a text based interface for this but there are advantages to both approches.
https://lwn.net/Articles/674867/ There a is need for some userspace API that would allow to expose things that are not directly related to any device class like net_device of ib_device, but rather chip-wide/switch-ASIC-wide stuff.
Use cases:
- get/set of port type (Ethernet/InfiniBand)
- monitoring of hardware messages to and from chip
- setting up port splitters - split port into multiple ones and squash again, enables usage of splitter cable
- setting up shared buffers - shared among multiple ports within one chip
we actually can also retrieve the same information through sysfs, .e.g
- [path to device]
|--- migration | |--- self | | |---device_api | | |---mdev_type | | |---software_version | | |---device_id | | |---aggregator | |--- compatible | | |---device_api | | |---mdev_type | | |---software_version | | |---device_id | | |---aggregator
I feel like it's not very appropriate for a GPU driver to use this interface. Is that right?
I think not though most of the users are switch or ethernet devices. It doesn't prevent you from inventing new abstractions.
so need to patch devlink core and the userspace devlink tool? e.g. devlink migration
and devlink python libs if openstack was to use it directly. we do have caes where we just frok a process and execaute a comannd in a shell with or without elevated privladge but we really dont like doing that due to the performacne impacat and security implciations so where we can use python bindign over c apis we do. pyroute2 is the only python lib i know off of the top of my head that support devlink so we would need to enhacne it to support this new devlink api. there may be otherss i have not really looked in the past since we dont need to use devlink at all today.
Note that devlink is based on netlink, netlink has been widely used by various subsystems other than networking.
the advantage of netlink I see is that it can monitor device status and notify upper layer that migration database needs to get updated. But not sure whether openstack would like to use this capability. As Sean said, it's heavy for openstack. it's heavy for vendor driver as well :)
And devlink monitor now listens the notification and dumps the state changes. If we want to use it, need to let it forward the notification and dumped info to openstack, right?
i dont think we would use direct devlink monitoring in nova even if it was avaiable. we could but we already poll libvirt and the system for other resouce periodicly. we likely wouldl just add monitoriv via devlink to that periodic task. we certenly would not use it to detect a migration or a need to update a migration database(not sure what that is)
in reality if we can consume this info indirectly via a libvirt api that will be the appcoh we will take at least for the libvirt driver in nova. for cyborg they may take a different appoch. we already use pyroute2 in 2 projects, os-vif and neutron and it does have devlink support so the burden of using devlink is not that high for openstack but its a less frineadly interface for configuration tools like ansiable vs a filesystem based approch.
Thanks Yan
On Fri, Aug 14, 2020 at 01:30:00PM +0100, Sean Mooney wrote:
On Fri, 2020-08-14 at 13:16 +0800, Yan Zhao wrote:
On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:
On 2020/8/10 下午3:46, Yan Zhao wrote:
driver is it handled by?
It looks that the devlink is for network device specific, and in devlink.h, it says include/uapi/linux/devlink.h - Network physical device Netlink interface,
Actually not, I think there used to have some discussion last year and the conclusion is to remove this comment.
It supports IB and probably vDPA in the future.
hmm... sorry, I didn't find the referred discussion. only below discussion regarding to why to add devlink.
https://www.mail-archive.com/netdev@vger.kernel.org/msg95801.html
This doesn't seem to be too much related to networking? Why can't something like this be in sysfs?
It is related to networking quite bit. There has been couple of iteration of this, including sysfs and configfs implementations. There has been a consensus reached that this should be done by netlink. I believe netlink is really the best for this purpose. Sysfs is not a good idea
https://www.mail-archive.com/netdev@vger.kernel.org/msg96102.html
there is already a way to change eth/ib via echo 'eth' > /sys/bus/pci/drivers/mlx4_core/0000:02:00.0/mlx4_port1
sounds like this is another way to achieve the same?
It is. However the current way is driver-specific, not correct. For mlx5, we need the same, it cannot be done in this way. Do devlink is the correct way to go.
im not sure i agree with that. standardising a filesystem based api that is used across all vendors is also a valid option. that said if devlink is the right choice form a kerenl perspective by all means use it but i have not heard a convincing argument for why it actually better. with tthat said we have been uing tools like ethtool to manage aspect of nics for decades so its not that strange an idea to use a tool and binary protocoal rather then a text based interface for this but there are advantages to both approches.
Yes, I agree with you.
https://lwn.net/Articles/674867/ There a is need for some userspace API that would allow to expose things that are not directly related to any device class like net_device of ib_device, but rather chip-wide/switch-ASIC-wide stuff.
Use cases:
- get/set of port type (Ethernet/InfiniBand)
- monitoring of hardware messages to and from chip
- setting up port splitters - split port into multiple ones and squash again, enables usage of splitter cable
- setting up shared buffers - shared among multiple ports within one chip
we actually can also retrieve the same information through sysfs, .e.g
- [path to device]
|--- migration | |--- self | | |---device_api | | |---mdev_type | | |---software_version | | |---device_id | | |---aggregator | |--- compatible | | |---device_api | | |---mdev_type | | |---software_version | | |---device_id | | |---aggregator
I feel like it's not very appropriate for a GPU driver to use this interface. Is that right?
I think not though most of the users are switch or ethernet devices. It doesn't prevent you from inventing new abstractions.
so need to patch devlink core and the userspace devlink tool? e.g. devlink migration
and devlink python libs if openstack was to use it directly. we do have caes where we just frok a process and execaute a comannd in a shell with or without elevated privladge but we really dont like doing that due to the performacne impacat and security implciations so where we can use python bindign over c apis we do. pyroute2 is the only python lib i know off of the top of my head that support devlink so we would need to enhacne it to support this new devlink api. there may be otherss i have not really looked in the past since we dont need to use devlink at all today.
Note that devlink is based on netlink, netlink has been widely used by various subsystems other than networking.
the advantage of netlink I see is that it can monitor device status and notify upper layer that migration database needs to get updated. But not sure whether openstack would like to use this capability. As Sean said, it's heavy for openstack. it's heavy for vendor driver as well :)
And devlink monitor now listens the notification and dumps the state changes. If we want to use it, need to let it forward the notification and dumped info to openstack, right?
i dont think we would use direct devlink monitoring in nova even if it was avaiable. we could but we already poll libvirt and the system for other resouce periodicly.
so, if we use file system based approach, could openstack periodically check and update the migration info? e.g. every minute, read /sys/<path to device>/migration/self/*, and if there are any file disappearing or appearing or content changes, just let the placement know.
Then when about to start migration, check source device's /sys/<path to src device>/migration/compatible/* and searches the placement if there are existing device matching to it, if yes, create vm with the device and migrate to it; if not, and if it's an mdev, try to create a matching one and migrate to it. (to create a matching mdev, I guess openstack can follow below sequence: 1. find a target device with the same device id (e.g. parent pci id) 2. create an mdev with matching mdev type 3. adjust other vendor specific attributes 4. if 2 or 3 fails, go to 1 again )
is this approach feasible?
we likely wouldl just add monitoriv via devlink to that periodic task. we certenly would not use it to detect a migration or a need to update a migration database(not sure what that is)
by migration database, I meant the traits in the placement. :)
if a periodic monitoring or devlink is required, then periodically monitor sysfs is also viable, right?
in reality if we can consume this info indirectly via a libvirt api that will be the appcoh we will take at least for the libvirt driver in nova. for cyborg they may take a different appoch. we already use pyroute2 in 2 projects, os-vif and neutron and it does have devlink support so the burden of using devlink is not that high for openstack but its a less frineadly interface for configuration tools like ansiable vs a filesystem based approch.
On 2020/8/14 下午1:16, Yan Zhao wrote:
On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:
On 2020/8/10 下午3:46, Yan Zhao wrote:
driver is it handled by?
It looks that the devlink is for network device specific, and in devlink.h, it says include/uapi/linux/devlink.h - Network physical device Netlink interface,
Actually not, I think there used to have some discussion last year and the conclusion is to remove this comment.
It supports IB and probably vDPA in the future.
hmm... sorry, I didn't find the referred discussion. only below discussion regarding to why to add devlink.
https://www.mail-archive.com/netdev@vger.kernel.org/msg95801.html
This doesn't seem to be too much related to networking? Why can't something like this be in sysfs?
It is related to networking quite bit. There has been couple of iteration of this, including sysfs and configfs implementations. There has been a consensus reached that this should be done by netlink. I believe netlink is really the best for this purpose. Sysfs is not a good idea
See the discussion here:
https://patchwork.ozlabs.org/project/netdev/patch/20191115223355.1277139-1-j...
https://www.mail-archive.com/netdev@vger.kernel.org/msg96102.html
there is already a way to change eth/ib via echo 'eth' > /sys/bus/pci/drivers/mlx4_core/0000:02:00.0/mlx4_port1
sounds like this is another way to achieve the same?
It is. However the current way is driver-specific, not correct. For mlx5, we need the same, it cannot be done in this way. Do devlink is the correct way to go.
https://lwn.net/Articles/674867/ There a is need for some userspace API that would allow to expose things that are not directly related to any device class like net_device of ib_device, but rather chip-wide/switch-ASIC-wide stuff.
Use cases:
- get/set of port type (Ethernet/InfiniBand)
- monitoring of hardware messages to and from chip
- setting up port splitters - split port into multiple ones and squash again, enables usage of splitter cable
- setting up shared buffers - shared among multiple ports within one chip
we actually can also retrieve the same information through sysfs, .e.g
|- [path to device] |--- migration | |--- self | | |---device_api | | |---mdev_type | | |---software_version | | |---device_id | | |---aggregator | |--- compatible | | |---device_api | | |---mdev_type | | |---software_version | | |---device_id | | |---aggregator
Yes but:
- You need one file per attribute (one syscall for one attribute) - Attribute is coupled with kobject
All of above seems unnecessary.
Another point, as we discussed in another thread, it's really hard to make sure the above API work for all types of devices and frameworks. So having a vendor specific API looks much better.
I feel like it's not very appropriate for a GPU driver to use this interface. Is that right?
I think not though most of the users are switch or ethernet devices. It doesn't prevent you from inventing new abstractions.
so need to patch devlink core and the userspace devlink tool? e.g. devlink migration
It quite flexible, you can extend devlink, invent your own or let mgmt to establish devlink directly.
Note that devlink is based on netlink, netlink has been widely used by various subsystems other than networking.
the advantage of netlink I see is that it can monitor device status and notify upper layer that migration database needs to get updated.
I may miss something, but why this is needed?
From device point of view, the following capability should be sufficient to support live migration:
- set/get device state - report dirty page tracking - set/get capability
But not sure whether openstack would like to use this capability. As Sean said, it's heavy for openstack. it's heavy for vendor driver as well :)
Well, it depends several factors. Just counting LOCs, sysfs based attributes is not lightweight.
Thanks
And devlink monitor now listens the notification and dumps the state changes. If we want to use it, need to let it forward the notification and dumped info to openstack, right?
Thanks Yan
On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote:
On 2020/8/14 下午1:16, Yan Zhao wrote:
On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:
On 2020/8/10 下午3:46, Yan Zhao wrote:
driver is it handled by?
It looks that the devlink is for network device specific, and in devlink.h, it says include/uapi/linux/devlink.h - Network physical device Netlink interface,
Actually not, I think there used to have some discussion last year and the conclusion is to remove this comment.
It supports IB and probably vDPA in the future.
hmm... sorry, I didn't find the referred discussion. only below discussion regarding to why to add devlink.
https://www.mail-archive.com/netdev@vger.kernel.org/msg95801.html
This doesn't seem to be too much related to networking? Why can't something like this be in sysfs?
It is related to networking quite bit. There has been couple of iteration of this, including sysfs and configfs implementations. There has been a consensus reached that this should be done by netlink. I believe netlink is really the best for this purpose. Sysfs is not a good idea
See the discussion here:
https://patchwork.ozlabs.org/project/netdev/patch/20191115223355.1277139-1-j...
https://www.mail-archive.com/netdev@vger.kernel.org/msg96102.html
there is already a way to change eth/ib via echo 'eth' > /sys/bus/pci/drivers/mlx4_core/0000:02:00.0/mlx4_port1
sounds like this is another way to achieve the same?
It is. However the current way is driver-specific, not correct. For mlx5, we need the same, it cannot be done in this way. Do devlink is the correct way to go.
https://lwn.net/Articles/674867/ There a is need for some userspace API that would allow to expose things that are not directly related to any device class like net_device of ib_device, but rather chip-wide/switch-ASIC-wide stuff.
Use cases:
- get/set of port type (Ethernet/InfiniBand)
- monitoring of hardware messages to and from chip
- setting up port splitters - split port into multiple ones and squash again, enables usage of splitter cable
- setting up shared buffers - shared among multiple ports within one chip
we actually can also retrieve the same information through sysfs, .e.g
|- [path to device] |--- migration | |--- self | | |---device_api | | |---mdev_type | | |---software_version | | |---device_id | | |---aggregator | |--- compatible | | |---device_api | | |---mdev_type | | |---software_version | | |---device_id | | |---aggregator
Yes but:
- You need one file per attribute (one syscall for one attribute)
- Attribute is coupled with kobject
All of above seems unnecessary.
Another point, as we discussed in another thread, it's really hard to make sure the above API work for all types of devices and frameworks. So having a vendor specific API looks much better.
From the POV of userspace mgmt apps doing device compat checking / migration,
we certainly do NOT want to use different vendor specific APIs. We want to have an API that can be used / controlled in a standard manner across vendors.
Regards, Daniel
Your mail came through as HTML-only so all the quoting and attribution is mangled / lost now :-(
On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote:
On 2020/8/18 下午4:55, Daniel P. Berrangé wrote:
On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote:
On 2020/8/14 下午1:16, Yan Zhao wrote:
On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:
On 2020/8/10 下午3:46, Yan Zhao wrote:
we actually can also retrieve the same information through sysfs, .e.g
|- [path to device] |--- migration | |--- self | | |---device_api | | |---mdev_type | | |---software_version | | |---device_id | | |---aggregator | |--- compatible | | |---device_api | | |---mdev_type | | |---software_version | | |---device_id | | |---aggregator
Yes but:
- You need one file per attribute (one syscall for one attribute)
- Attribute is coupled with kobject
All of above seems unnecessary.
Another point, as we discussed in another thread, it's really hard to make sure the above API work for all types of devices and frameworks. So having a vendor specific API looks much better.
From the POV of userspace mgmt apps doing device compat checking / migration, we certainly do NOT want to use different vendor specific APIs. We want to have an API that can be used / controlled in a standard manner across vendors.
Yes, but it could be hard. E.g vDPA will chose to use devlink (there's a long debate on sysfs vs devlink). So if we go with sysfs, at least two APIs needs to be supported ...
NB, I was not questioning devlink vs sysfs directly. If devlink is related to netlink, I can't say I'm enthusiastic as IMKE sysfs is easier to deal with. I don't know enough about devlink to have much of an opinion though. The key point was that I don't want the userspace APIs we need to deal with to be vendor specific.
What I care about is that we have a *standard* userspace API for performing device compatibility checking / state migration, for use by QEMU/libvirt/ OpenStack, such that we can write code without countless vendor specific code paths.
If there is vendor specific stuff on the side, that's fine as we can ignore that, but the core functionality for device compat / migration needs to be standardized.
Regards, Daniel
On Tue, 18 Aug 2020 10:16:28 +0100 Daniel P. Berrangé berrange@redhat.com wrote:
On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote:
On 2020/8/18 下午4:55, Daniel P. Berrangé wrote:
On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote:
On 2020/8/14 下午1:16, Yan Zhao wrote:
On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:
On 2020/8/10 下午3:46, Yan Zhao wrote:
we actually can also retrieve the same information through sysfs, .e.g
|- [path to device] |--- migration | |--- self | | |---device_api | | |---mdev_type | | |---software_version | | |---device_id | | |---aggregator | |--- compatible | | |---device_api | | |---mdev_type | | |---software_version | | |---device_id | | |---aggregator
Yes but:
- You need one file per attribute (one syscall for one attribute)
- Attribute is coupled with kobject
Is that really that bad? You have the device with an embedded kobject anyway, and you can just put things into an attribute group?
[Also, I think that self/compatible split in the example makes things needlessly complex. Shouldn't semantic versioning and matching already cover nearly everything? I would expect very few cases that are more complex than that. Maybe the aggregation stuff, but I don't think we need that self/compatible split for that, either.]
All of above seems unnecessary.
Another point, as we discussed in another thread, it's really hard to make sure the above API work for all types of devices and frameworks. So having a vendor specific API looks much better.
From the POV of userspace mgmt apps doing device compat checking / migration, we certainly do NOT want to use different vendor specific APIs. We want to have an API that can be used / controlled in a standard manner across vendors.
Yes, but it could be hard. E.g vDPA will chose to use devlink (there's a long debate on sysfs vs devlink). So if we go with sysfs, at least two APIs needs to be supported ...
NB, I was not questioning devlink vs sysfs directly. If devlink is related to netlink, I can't say I'm enthusiastic as IMKE sysfs is easier to deal with. I don't know enough about devlink to have much of an opinion though. The key point was that I don't want the userspace APIs we need to deal with to be vendor specific.
From what I've seen of devlink, it seems quite nice; but I understand why sysfs might be easier to deal with (especially as there's likely already a lot of code using it.)
I understand that some users would like devlink because it is already widely used for network drivers (and some others), but I don't think the majority of devices used with vfio are network (although certainly a lot of them are.)
What I care about is that we have a *standard* userspace API for performing device compatibility checking / state migration, for use by QEMU/libvirt/ OpenStack, such that we can write code without countless vendor specific code paths.
If there is vendor specific stuff on the side, that's fine as we can ignore that, but the core functionality for device compat / migration needs to be standardized.
To summarize: - choose one of sysfs or devlink - have a common interface, with a standardized way to add vendor-specific attributes ?
Hi Cornelia,
From: Cornelia Huck cohuck@redhat.com Sent: Tuesday, August 18, 2020 3:07 PM To: Daniel P. Berrangé berrange@redhat.com Cc: Jason Wang jasowang@redhat.com; Yan Zhao yan.y.zhao@intel.com; kvm@vger.kernel.org; libvir-list@redhat.com; qemu-devel@nongnu.org; Kirti Wankhede kwankhede@nvidia.com; eauger@redhat.com; xin-ran.wang@intel.com; corbet@lwn.net; openstack- discuss@lists.openstack.org; shaohe.feng@intel.com; kevin.tian@intel.com; Parav Pandit parav@mellanox.com; jian-feng.ding@intel.com; dgilbert@redhat.com; zhenyuw@linux.intel.com; hejie.xu@intel.com; bao.yumeng@zte.com.cn; Alex Williamson alex.williamson@redhat.com; eskultet@redhat.com; smooney@redhat.com; intel-gvt- dev@lists.freedesktop.org; Jiri Pirko jiri@mellanox.com; dinechin@redhat.com; devel@ovirt.org Subject: Re: device compatibility interface for live migration with assigned devices
On Tue, 18 Aug 2020 10:16:28 +0100 Daniel P. Berrangé berrange@redhat.com wrote:
On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote:
On 2020/8/18 下午4:55, Daniel P. Berrangé wrote:
On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote:
On 2020/8/14 下午1:16, Yan Zhao wrote:
On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:
On 2020/8/10 下午3:46, Yan Zhao wrote:
we actually can also retrieve the same information through sysfs, .e.g
|- [path to device] |--- migration | |--- self | | |---device_api | | |---mdev_type | | |---software_version | | |---device_id | | |---aggregator | |--- compatible | | |---device_api | | |---mdev_type | | |---software_version | | |---device_id | | |---aggregator
Yes but:
- You need one file per attribute (one syscall for one attribute)
- Attribute is coupled with kobject
Is that really that bad? You have the device with an embedded kobject anyway, and you can just put things into an attribute group?
[Also, I think that self/compatible split in the example makes things needlessly complex. Shouldn't semantic versioning and matching already cover nearly everything? I would expect very few cases that are more complex than that. Maybe the aggregation stuff, but I don't think we need that self/compatible split for that, either.]
All of above seems unnecessary.
Another point, as we discussed in another thread, it's really hard to make sure the above API work for all types of devices and frameworks. So having a vendor specific API looks much better.
From the POV of userspace mgmt apps doing device compat checking / migration, we certainly do NOT want to use different vendor specific APIs. We want to have an API that can be used / controlled in a
standard manner across vendors.
Yes, but it could be hard. E.g vDPA will chose to use devlink (there's a long debate on sysfs vs devlink). So if we go with sysfs, at least two APIs needs to be supported ...
NB, I was not questioning devlink vs sysfs directly. If devlink is related to netlink, I can't say I'm enthusiastic as IMKE sysfs is easier to deal with. I don't know enough about devlink to have much of an
opinion though.
The key point was that I don't want the userspace APIs we need to deal with to be vendor specific.
From what I've seen of devlink, it seems quite nice; but I understand why sysfs might be easier to deal with (especially as there's likely already a lot of code using it.)
I understand that some users would like devlink because it is already widely used for network drivers (and some others), but I don't think the majority of devices used with vfio are network (although certainly a lot of them are.)
What I care about is that we have a *standard* userspace API for performing device compatibility checking / state migration, for use by QEMU/libvirt/ OpenStack, such that we can write code without countless vendor specific code paths.
If there is vendor specific stuff on the side, that's fine as we can ignore that, but the core functionality for device compat / migration needs to be standardized.
To summarize:
- choose one of sysfs or devlink
- have a common interface, with a standardized way to add vendor-specific attributes
?
Please refer to my previous email which has more example and details.
On Tue, Aug 18, 2020 at 09:39:24AM +0000, Parav Pandit wrote:
Hi Cornelia,
From: Cornelia Huck cohuck@redhat.com Sent: Tuesday, August 18, 2020 3:07 PM To: Daniel P. Berrangé berrange@redhat.com Cc: Jason Wang jasowang@redhat.com; Yan Zhao yan.y.zhao@intel.com; kvm@vger.kernel.org; libvir-list@redhat.com; qemu-devel@nongnu.org; Kirti Wankhede kwankhede@nvidia.com; eauger@redhat.com; xin-ran.wang@intel.com; corbet@lwn.net; openstack- discuss@lists.openstack.org; shaohe.feng@intel.com; kevin.tian@intel.com; Parav Pandit parav@mellanox.com; jian-feng.ding@intel.com; dgilbert@redhat.com; zhenyuw@linux.intel.com; hejie.xu@intel.com; bao.yumeng@zte.com.cn; Alex Williamson alex.williamson@redhat.com; eskultet@redhat.com; smooney@redhat.com; intel-gvt- dev@lists.freedesktop.org; Jiri Pirko jiri@mellanox.com; dinechin@redhat.com; devel@ovirt.org Subject: Re: device compatibility interface for live migration with assigned devices
On Tue, 18 Aug 2020 10:16:28 +0100 Daniel P. Berrangé berrange@redhat.com wrote:
On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote:
On 2020/8/18 下午4:55, Daniel P. Berrangé wrote:
On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote:
On 2020/8/14 下午1:16, Yan Zhao wrote:
On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:
On 2020/8/10 下午3:46, Yan Zhao wrote:
we actually can also retrieve the same information through sysfs, .e.g
|- [path to device] |--- migration | |--- self | | |---device_api | | |---mdev_type | | |---software_version | | |---device_id | | |---aggregator | |--- compatible | | |---device_api | | |---mdev_type | | |---software_version | | |---device_id | | |---aggregator
Yes but:
- You need one file per attribute (one syscall for one attribute)
- Attribute is coupled with kobject
Is that really that bad? You have the device with an embedded kobject anyway, and you can just put things into an attribute group?
[Also, I think that self/compatible split in the example makes things needlessly complex. Shouldn't semantic versioning and matching already cover nearly everything? I would expect very few cases that are more complex than that. Maybe the aggregation stuff, but I don't think we need that self/compatible split for that, either.]
All of above seems unnecessary.
Another point, as we discussed in another thread, it's really hard to make sure the above API work for all types of devices and frameworks. So having a vendor specific API looks much better.
From the POV of userspace mgmt apps doing device compat checking / migration, we certainly do NOT want to use different vendor specific APIs. We want to have an API that can be used / controlled in a
standard manner across vendors.
Yes, but it could be hard. E.g vDPA will chose to use devlink (there's a long debate on sysfs vs devlink). So if we go with sysfs, at least two APIs needs to be supported ...
NB, I was not questioning devlink vs sysfs directly. If devlink is related to netlink, I can't say I'm enthusiastic as IMKE sysfs is easier to deal with. I don't know enough about devlink to have much of an
opinion though.
The key point was that I don't want the userspace APIs we need to deal with to be vendor specific.
From what I've seen of devlink, it seems quite nice; but I understand why sysfs might be easier to deal with (especially as there's likely already a lot of code using it.)
I understand that some users would like devlink because it is already widely used for network drivers (and some others), but I don't think the majority of devices used with vfio are network (although certainly a lot of them are.)
What I care about is that we have a *standard* userspace API for performing device compatibility checking / state migration, for use by QEMU/libvirt/ OpenStack, such that we can write code without countless vendor specific code paths.
If there is vendor specific stuff on the side, that's fine as we can ignore that, but the core functionality for device compat / migration needs to be standardized.
To summarize:
- choose one of sysfs or devlink
- have a common interface, with a standardized way to add vendor-specific attributes
?
Please refer to my previous email which has more example and details.
hi Parav, the example is based on a new vdpa tool running over netlink, not based on devlink, right? For vfio migration compatibility, we have to deal with both mdev and physical pci devices, I don't think it's a good idea to write a new tool for it, given we are able to retrieve the same info from sysfs and there's already an mdevctl from Alex (https://github.com/mdevctl/mdevctl).
hi All, could we decide that sysfs is the interface that every VFIO vendor driver needs to provide in order to support vfio live migration, otherwise the userspace management tool would not list the device into the compatible list?
if that's true, let's move to the standardizing of the sysfs interface. (1) content common part: (must) - software_version: (in major.minor.bugfix scheme) - device_api: vfio-pci or vfio-ccw ... - type: mdev type for mdev device or a signature for physical device which is a counterpart for mdev type.
device api specific part: (must) - pci id: pci id of mdev parent device or pci id of physical pci device (device_api is vfio-pci) - subchannel_type (device_api is vfio-ccw)
vendor driver specific part: (optional) - aggregator - chpid_type - remote_url
NOTE: vendors are free to add attributes in this part with a restriction that this attribute is able to be configured with the same name in sysfs too. e.g. for aggregator, there must be a sysfs attribute in device node /sys/devices/pci0000:00/0000:00:02.0/882cc4da-dede-11e7-9180-078a62063ab1/intel_vgpu/aggregator, so that the userspace tool is able to configure the target device according to source device's aggregator attribute.
(2) where and structure proposal 1: |- [path to device] |--- migration | |--- self | | |-software_version | | |-device_api | | |-type | | |-[pci_id or subchannel_type] | | |-<aggregator or chpid_type> | |--- compatible | | |-software_version | | |-device_api | | |-type | | |-[pci_id or subchannel_type] | | |-<aggregator or chpid_type> multiple compatible is allowed. attributes should be ASCII text files, preferably with only one value per file.
proposal 2: use bin_attribute. |- [path to device] |--- migration | |--- self | |--- compatible
so we can continue use multiline format. e.g. cat compatible software_version=0.1.0 device_api=vfio_pci type=i915-GVTg_V5_{val1:int:1,2,4,8} pci_id=80865963 aggregator={val1}/2
Thanks Yan
From: Yan Zhao yan.y.zhao@intel.com Sent: Wednesday, August 19, 2020 9:01 AM
On Tue, Aug 18, 2020 at 09:39:24AM +0000, Parav Pandit wrote:
Please refer to my previous email which has more example and details.
hi Parav, the example is based on a new vdpa tool running over netlink, not based on devlink, right?
Right.
For vfio migration compatibility, we have to deal with both mdev and physical pci devices, I don't think it's a good idea to write a new tool for it, given we are able to retrieve the same info from sysfs and there's already an mdevctl from
mdev attribute should be visible in the mdev's sysfs tree. I do not propose to write a new mdev tool over netlink. I am sorry if I implied that with my suggestion of vdpa tool.
If underlying device is vdpa, mdev might be able to understand vdpa device and query from it and populate in mdev sysfs tree.
The vdpa tool I propose is usable even without mdevs. vdpa tool's role is to create one or more vdpa devices and place on the "vdpa" bus which is the lowest layer here. Additionally this tool let user query virtqueue stats, db stats. When a user creates vdpa net device, user may need to configure features of the vdpa device such as VIRTIO_NET_F_MAC, default VIRTIO_NET_F_MTU. These are vdpa level features, attributes. Mdev is layer above it.
Alex (https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub. com%2Fmdevctl%2Fmdevctl&data=02%7C01%7Cparav%40nvidia.com%7C 0c2691d430304f5ea11308d843f2d84e%7C43083d15727340c1b7db39efd9ccc17 a%7C0%7C0%7C637334057571911357&sdata=KxH7PwxmKyy9JODut8BWr LQyOBylW00%2Fyzc4rEvjUvA%3D&reserved=0).
Sorry for above link mangling. Our mail server is still transitioning due to company acquisition.
I am less familiar on below points to comment.
hi All, could we decide that sysfs is the interface that every VFIO vendor driver needs to provide in order to support vfio live migration, otherwise the userspace management tool would not list the device into the compatible list?
if that's true, let's move to the standardizing of the sysfs interface. (1) content common part: (must)
- software_version: (in major.minor.bugfix scheme)
- device_api: vfio-pci or vfio-ccw ...
- type: mdev type for mdev device or a signature for physical device which is a counterpart for mdev type.
device api specific part: (must)
- pci id: pci id of mdev parent device or pci id of physical pci device (device_api is vfio-pci)
- subchannel_type (device_api is vfio-ccw)
vendor driver specific part: (optional)
- aggregator
- chpid_type
- remote_url
NOTE: vendors are free to add attributes in this part with a restriction that this attribute is able to be configured with the same name in sysfs too. e.g. for aggregator, there must be a sysfs attribute in device node /sys/devices/pci0000:00/0000:00:02.0/882cc4da-dede-11e7-9180- 078a62063ab1/intel_vgpu/aggregator, so that the userspace tool is able to configure the target device according to source device's aggregator attribute.
(2) where and structure proposal 1: |- [path to device] |--- migration | |--- self | | |-software_version | | |-device_api | | |-type | | |-[pci_id or subchannel_type] | | |-<aggregator or chpid_type> | |--- compatible | | |-software_version | | |-device_api | | |-type | | |-[pci_id or subchannel_type] | | |-<aggregator or chpid_type> multiple compatible is allowed. attributes should be ASCII text files, preferably with only one value per file.
proposal 2: use bin_attribute. |- [path to device] |--- migration | |--- self | |--- compatible
so we can continue use multiline format. e.g. cat compatible software_version=0.1.0 device_api=vfio_pci type=i915-GVTg_V5_{val1:int:1,2,4,8} pci_id=80865963 aggregator={val1}/2
Thanks Yan
On 2020/8/19 下午1:58, Parav Pandit wrote:
From: Yan Zhao yan.y.zhao@intel.com Sent: Wednesday, August 19, 2020 9:01 AM On Tue, Aug 18, 2020 at 09:39:24AM +0000, Parav Pandit wrote:
Please refer to my previous email which has more example and details.
hi Parav, the example is based on a new vdpa tool running over netlink, not based on devlink, right?
Right.
For vfio migration compatibility, we have to deal with both mdev and physical pci devices, I don't think it's a good idea to write a new tool for it, given we are able to retrieve the same info from sysfs and there's already an mdevctl from
mdev attribute should be visible in the mdev's sysfs tree. I do not propose to write a new mdev tool over netlink. I am sorry if I implied that with my suggestion of vdpa tool.
If underlying device is vdpa, mdev might be able to understand vdpa device and query from it and populate in mdev sysfs tree.
Note that vdpa is bus independent so it can't work now and the support of mdev on top of vDPA have been rejected (and duplicated with vhost-vDPA).
Thanks
The vdpa tool I propose is usable even without mdevs. vdpa tool's role is to create one or more vdpa devices and place on the "vdpa" bus which is the lowest layer here. Additionally this tool let user query virtqueue stats, db stats. When a user creates vdpa net device, user may need to configure features of the vdpa device such as VIRTIO_NET_F_MAC, default VIRTIO_NET_F_MTU. These are vdpa level features, attributes. Mdev is layer above it.
Alex (https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub. com%2Fmdevctl%2Fmdevctl&data=02%7C01%7Cparav%40nvidia.com%7C 0c2691d430304f5ea11308d843f2d84e%7C43083d15727340c1b7db39efd9ccc17 a%7C0%7C0%7C637334057571911357&sdata=KxH7PwxmKyy9JODut8BWr LQyOBylW00%2Fyzc4rEvjUvA%3D&reserved=0).
Sorry for above link mangling. Our mail server is still transitioning due to company acquisition.
I am less familiar on below points to comment.
hi All, could we decide that sysfs is the interface that every VFIO vendor driver needs to provide in order to support vfio live migration, otherwise the userspace management tool would not list the device into the compatible list?
if that's true, let's move to the standardizing of the sysfs interface. (1) content common part: (must) - software_version: (in major.minor.bugfix scheme) - device_api: vfio-pci or vfio-ccw ... - type: mdev type for mdev device or a signature for physical device which is a counterpart for mdev type.
device api specific part: (must)
- pci id: pci id of mdev parent device or pci id of physical pci device (device_api is vfio-pci)
- subchannel_type (device_api is vfio-ccw)
vendor driver specific part: (optional)
- aggregator
- chpid_type
- remote_url
NOTE: vendors are free to add attributes in this part with a restriction that this attribute is able to be configured with the same name in sysfs too. e.g. for aggregator, there must be a sysfs attribute in device node /sys/devices/pci0000:00/0000:00:02.0/882cc4da-dede-11e7-9180- 078a62063ab1/intel_vgpu/aggregator, so that the userspace tool is able to configure the target device according to source device's aggregator attribute.
(2) where and structure proposal 1: |- [path to device] |--- migration | |--- self | | |-software_version | | |-device_api | | |-type | | |-[pci_id or subchannel_type] | | |-<aggregator or chpid_type> | |--- compatible | | |-software_version | | |-device_api | | |-type | | |-[pci_id or subchannel_type] | | |-<aggregator or chpid_type> multiple compatible is allowed. attributes should be ASCII text files, preferably with only one value per file.
proposal 2: use bin_attribute. |- [path to device] |--- migration | |--- self | |--- compatible
so we can continue use multiline format. e.g. cat compatible software_version=0.1.0 device_api=vfio_pci type=i915-GVTg_V5_{val1:int:1,2,4,8} pci_id=80865963 aggregator={val1}/2
Thanks Yan
On 2020/8/19 上午11:30, Yan Zhao wrote:
hi All, could we decide that sysfs is the interface that every VFIO vendor driver needs to provide in order to support vfio live migration, otherwise the userspace management tool would not list the device into the compatible list?
if that's true, let's move to the standardizing of the sysfs interface. (1) content common part: (must) - software_version: (in major.minor.bugfix scheme)
This can not work for devices whose features can be negotiated/advertised independently. (E.g virtio devices)
- device_api: vfio-pci or vfio-ccw ... - type: mdev type for mdev device or a signature for physical device which is a counterpart for mdev type.
device api specific part: (must)
- pci id: pci id of mdev parent device or pci id of physical pci device (device_api is vfio-pci)API here.
So this assumes a PCI device which is probably not true.
- subchannel_type (device_api is vfio-ccw)
vendor driver specific part: (optional)
- aggregator
- chpid_type
- remote_url
For "remote_url", just wonder if it's better to integrate or reuse the existing NVME management interface instead of duplicating it here. Otherwise it could be a burden for mgmt to learn. E.g vendor A may use "remote_url" but vendor B may use a different attribute.
NOTE: vendors are free to add attributes in this part with a restriction that this attribute is able to be configured with the same name in sysfs too. e.g.
Sysfs works well for common attributes belongs to a class, but I'm not sure it can work well for device/vendor specific attributes. Does this mean mgmt need to iterate all the attributes in both src and dst?
for aggregator, there must be a sysfs attribute in device node /sys/devices/pci0000:00/0000:00:02.0/882cc4da-dede-11e7-9180-078a62063ab1/intel_vgpu/aggregator, so that the userspace tool is able to configure the target device according to source device's aggregator attribute.
(2) where and structure proposal 1: |- [path to device] |--- migration | |--- self | | |-software_version | | |-device_api | | |-type | | |-[pci_id or subchannel_type] | | |-<aggregator or chpid_type> | |--- compatible | | |-software_version | | |-device_api | | |-type | | |-[pci_id or subchannel_type] | | |-<aggregator or chpid_type> multiple compatible is allowed. attributes should be ASCII text files, preferably with only one value per file.
proposal 2: use bin_attribute. |- [path to device] |--- migration | |--- self | |--- compatible
so we can continue use multiline format. e.g. cat compatible software_version=0.1.0 device_api=vfio_pci type=i915-GVTg_V5_{val1:int:1,2,4,8} pci_id=80865963 aggregator={val1}/2
So basically two questions:
- how hard to standardize sysfs API for dealing with compatibility check (to make it work for most types of devices) - how hard for the mgmt to learn with a vendor specific attributes (vs existing management API)
Thanks
Thanks Yan
On Wed, Aug 19, 2020 at 02:57:34PM +0800, Jason Wang wrote:
On 2020/8/19 上午11:30, Yan Zhao wrote:
hi All, could we decide that sysfs is the interface that every VFIO vendor driver needs to provide in order to support vfio live migration, otherwise the userspace management tool would not list the device into the compatible list?
if that's true, let's move to the standardizing of the sysfs interface. (1) content common part: (must) - software_version: (in major.minor.bugfix scheme)
This can not work for devices whose features can be negotiated/advertised independently. (E.g virtio devices)
sorry, I don't understand here, why virtio devices need to use vfio interface? I think this thread is discussing about vfio related devices.
- device_api: vfio-pci or vfio-ccw ... - type: mdev type for mdev device or a signature for physical device which is a counterpart for mdev type.
device api specific part: (must)
- pci id: pci id of mdev parent device or pci id of physical pci device (device_api is vfio-pci)API here.
So this assumes a PCI device which is probably not true.
for device_api of vfio-pci, why it's not true?
for vfio-ccw, it's subchannel_type.
- subchannel_type (device_api is vfio-ccw)
vendor driver specific part: (optional)
- aggregator
- chpid_type
- remote_url
For "remote_url", just wonder if it's better to integrate or reuse the existing NVME management interface instead of duplicating it here. Otherwise it could be a burden for mgmt to learn. E.g vendor A may use "remote_url" but vendor B may use a different attribute.
it's vendor driver specific. vendor specific attributes are inevitable, and that's why we are discussing here of a way to standardizing of it. our goal is that mgmt can use it without understanding the meaning of vendor specific attributes.
NOTE: vendors are free to add attributes in this part with a restriction that this attribute is able to be configured with the same name in sysfs too. e.g.
Sysfs works well for common attributes belongs to a class, but I'm not sure it can work well for device/vendor specific attributes. Does this mean mgmt need to iterate all the attributes in both src and dst?
no. just attributes under migration directory.
for aggregator, there must be a sysfs attribute in device node /sys/devices/pci0000:00/0000:00:02.0/882cc4da-dede-11e7-9180-078a62063ab1/intel_vgpu/aggregator, so that the userspace tool is able to configure the target device according to source device's aggregator attribute.
(2) where and structure proposal 1: |- [path to device] |--- migration | |--- self | | |-software_version | | |-device_api | | |-type | | |-[pci_id or subchannel_type] | | |-<aggregator or chpid_type> | |--- compatible | | |-software_version | | |-device_api | | |-type | | |-[pci_id or subchannel_type] | | |-<aggregator or chpid_type> multiple compatible is allowed. attributes should be ASCII text files, preferably with only one value per file.
proposal 2: use bin_attribute. |- [path to device] |--- migration | |--- self | |--- compatible
so we can continue use multiline format. e.g. cat compatible software_version=0.1.0 device_api=vfio_pci type=i915-GVTg_V5_{val1:int:1,2,4,8} pci_id=80865963 aggregator={val1}/2
So basically two questions:
- how hard to standardize sysfs API for dealing with compatibility check (to
make it work for most types of devices)
sorry, I just know we are in the process of standardizing of it :)
- how hard for the mgmt to learn with a vendor specific attributes (vs
existing management API)
what is existing management API?
Thanks
On 2020/8/19 下午2:59, Yan Zhao wrote:
On Wed, Aug 19, 2020 at 02:57:34PM +0800, Jason Wang wrote:
On 2020/8/19 上午11:30, Yan Zhao wrote:
hi All, could we decide that sysfs is the interface that every VFIO vendor driver needs to provide in order to support vfio live migration, otherwise the userspace management tool would not list the device into the compatible list?
if that's true, let's move to the standardizing of the sysfs interface. (1) content common part: (must) - software_version: (in major.minor.bugfix scheme)
This can not work for devices whose features can be negotiated/advertised independently. (E.g virtio devices)
sorry, I don't understand here, why virtio devices need to use vfio interface?
I don't see any reason that virtio devices can't be used by VFIO. Do you?
Actually, virtio devices have been used by VFIO for many years:
- passthrough a hardware virtio devices to userspace(VM) drivers - using virtio PMD inside guest
I think this thread is discussing about vfio related devices.
- device_api: vfio-pci or vfio-ccw ... - type: mdev type for mdev device or a signature for physical device which is a counterpart for mdev type.
device api specific part: (must) - pci id: pci id of mdev parent device or pci id of physical pci device (device_api is vfio-pci)API here.
So this assumes a PCI device which is probably not true.
for device_api of vfio-pci, why it's not true?
for vfio-ccw, it's subchannel_type.
Ok but having two different attributes for the same file is not good idea. How mgmt know there will be a 3rd type?
- subchannel_type (device_api is vfio-ccw)
vendor driver specific part: (optional) - aggregator - chpid_type - remote_url
For "remote_url", just wonder if it's better to integrate or reuse the existing NVME management interface instead of duplicating it here. Otherwise it could be a burden for mgmt to learn. E.g vendor A may use "remote_url" but vendor B may use a different attribute.
it's vendor driver specific. vendor specific attributes are inevitable, and that's why we are discussing here of a way to standardizing of it.
Well, then you will end up with a very long list to discuss. E.g for networking devices, you will have "mac", "v(x)lan" and a lot of other.
Note that "remote_url" is not vendor specific but NVME (class/subsystem) specific.
The point is that if vendor/class specific part is unavoidable, why not making all of the attributes vendor specific?
our goal is that mgmt can use it without understanding the meaning of vendor specific attributes.
I'm not sure this is the correct design of uAPI. Is there something similar in the existing uAPIs?
And it might be hard to work for virtio devices.
NOTE: vendors are free to add attributes in this part with a restriction that this attribute is able to be configured with the same name in sysfs too. e.g.
Sysfs works well for common attributes belongs to a class, but I'm not sure it can work well for device/vendor specific attributes. Does this mean mgmt need to iterate all the attributes in both src and dst?
no. just attributes under migration directory.
for aggregator, there must be a sysfs attribute in device node /sys/devices/pci0000:00/0000:00:02.0/882cc4da-dede-11e7-9180-078a62063ab1/intel_vgpu/aggregator, so that the userspace tool is able to configure the target device according to source device's aggregator attribute.
(2) where and structure proposal 1: |- [path to device] |--- migration | |--- self | | |-software_version | | |-device_api | | |-type | | |-[pci_id or subchannel_type] | | |-<aggregator or chpid_type> | |--- compatible | | |-software_version | | |-device_api | | |-type | | |-[pci_id or subchannel_type] | | |-<aggregator or chpid_type> multiple compatible is allowed. attributes should be ASCII text files, preferably with only one value per file.
proposal 2: use bin_attribute. |- [path to device] |--- migration | |--- self | |--- compatible
so we can continue use multiline format. e.g. cat compatible software_version=0.1.0 device_api=vfio_pci type=i915-GVTg_V5_{val1:int:1,2,4,8} pci_id=80865963 aggregator={val1}/2
So basically two questions:
- how hard to standardize sysfs API for dealing with compatibility check (to
make it work for most types of devices)
sorry, I just know we are in the process of standardizing of it :)
It's not easy. As I said, the current design can't work for virtio devices and it's not hard to find other examples. I remember some Intel devices have bitmask based capability registers.
- how hard for the mgmt to learn with a vendor specific attributes (vs
existing management API)
what is existing management API?
It depends on the type of devices. E.g for NVME, we've already had one (/sys/kernel/config/nvme)?
Thanks
Thanks
On Wed, Aug 19, 2020 at 03:39:50PM +0800, Jason Wang wrote:
On 2020/8/19 下午2:59, Yan Zhao wrote:
On Wed, Aug 19, 2020 at 02:57:34PM +0800, Jason Wang wrote:
On 2020/8/19 上午11:30, Yan Zhao wrote:
hi All, could we decide that sysfs is the interface that every VFIO vendor driver needs to provide in order to support vfio live migration, otherwise the userspace management tool would not list the device into the compatible list?
if that's true, let's move to the standardizing of the sysfs interface. (1) content common part: (must) - software_version: (in major.minor.bugfix scheme)
This can not work for devices whose features can be negotiated/advertised independently. (E.g virtio devices)
sorry, I don't understand here, why virtio devices need to use vfio interface?
I don't see any reason that virtio devices can't be used by VFIO. Do you?
Actually, virtio devices have been used by VFIO for many years:
- passthrough a hardware virtio devices to userspace(VM) drivers
- using virtio PMD inside guest
So, what's different for it vs passing through a physical hardware via VFIO? even though the features are negotiated dynamically, could you explain why it would cause software_version not work?
I think this thread is discussing about vfio related devices.
- device_api: vfio-pci or vfio-ccw ... - type: mdev type for mdev device or a signature for physical device which is a counterpart for mdev type.
device api specific part: (must) - pci id: pci id of mdev parent device or pci id of physical pci device (device_api is vfio-pci)API here.
So this assumes a PCI device which is probably not true.
for device_api of vfio-pci, why it's not true?
for vfio-ccw, it's subchannel_type.
Ok but having two different attributes for the same file is not good idea. How mgmt know there will be a 3rd type?
that's why some attributes need to be common. e.g. device_api: it's common because mgmt need to know it's a pci device or a ccw device. and the api type is already defined vfio.h. (The field is agreed by and actually suggested by Alex in previous mail) type: mdev_type for mdev. if mgmt does not understand it, it would not be able to create one compatible mdev device. software_version: mgmt can compare the major and minor if it understands this fields.
- subchannel_type (device_api is vfio-ccw)
vendor driver specific part: (optional) - aggregator - chpid_type - remote_url
For "remote_url", just wonder if it's better to integrate or reuse the existing NVME management interface instead of duplicating it here. Otherwise it could be a burden for mgmt to learn. E.g vendor A may use "remote_url" but vendor B may use a different attribute.
it's vendor driver specific. vendor specific attributes are inevitable, and that's why we are discussing here of a way to standardizing of it.
Well, then you will end up with a very long list to discuss. E.g for networking devices, you will have "mac", "v(x)lan" and a lot of other.
Note that "remote_url" is not vendor specific but NVME (class/subsystem) specific.
yes, it's just NVMe specific. I added it as an example to show what is vendor specific. if one attribute is vendor specific across all vendors, then it's not vendor specific, it's already common attribute, right?
The point is that if vendor/class specific part is unavoidable, why not making all of the attributes vendor specific?
some parts need to be common, as I listed above.
our goal is that mgmt can use it without understanding the meaning of vendor specific attributes.
I'm not sure this is the correct design of uAPI. Is there something similar in the existing uAPIs?
And it might be hard to work for virtio devices.
NOTE: vendors are free to add attributes in this part with a restriction that this attribute is able to be configured with the same name in sysfs too. e.g.
Sysfs works well for common attributes belongs to a class, but I'm not sure it can work well for device/vendor specific attributes. Does this mean mgmt need to iterate all the attributes in both src and dst?
no. just attributes under migration directory.
for aggregator, there must be a sysfs attribute in device node /sys/devices/pci0000:00/0000:00:02.0/882cc4da-dede-11e7-9180-078a62063ab1/intel_vgpu/aggregator, so that the userspace tool is able to configure the target device according to source device's aggregator attribute.
(2) where and structure proposal 1: |- [path to device] |--- migration | |--- self | | |-software_version | | |-device_api | | |-type | | |-[pci_id or subchannel_type] | | |-<aggregator or chpid_type> | |--- compatible | | |-software_version | | |-device_api | | |-type | | |-[pci_id or subchannel_type] | | |-<aggregator or chpid_type> multiple compatible is allowed. attributes should be ASCII text files, preferably with only one value per file.
proposal 2: use bin_attribute. |- [path to device] |--- migration | |--- self | |--- compatible
so we can continue use multiline format. e.g. cat compatible software_version=0.1.0 device_api=vfio_pci type=i915-GVTg_V5_{val1:int:1,2,4,8} pci_id=80865963 aggregator={val1}/2
So basically two questions:
- how hard to standardize sysfs API for dealing with compatibility check (to
make it work for most types of devices)
sorry, I just know we are in the process of standardizing of it :)
It's not easy. As I said, the current design can't work for virtio devices and it's not hard to find other examples. I remember some Intel devices have bitmask based capability registers.
some Intel devices have bitmask based capability registers. so what? we have defined pci_id to identify the devices. even two different devices have equal PCI IDs, we still allow them to add vendor specific fields. e.g. for QAT, they can add alg_set to identify hardware supported algorithms.
- how hard for the mgmt to learn with a vendor specific attributes (vs
existing management API)
what is existing management API?
It depends on the type of devices. E.g for NVME, we've already had one (/sys/kernel/config/nvme)?
if the device is binding to vfio or vfio-mdev, I believe this interface is not there.
Thanks Yan
On 2020/8/19 下午4:13, Yan Zhao wrote:
On Wed, Aug 19, 2020 at 03:39:50PM +0800, Jason Wang wrote:
On 2020/8/19 下午2:59, Yan Zhao wrote:
On Wed, Aug 19, 2020 at 02:57:34PM +0800, Jason Wang wrote:
On 2020/8/19 上午11:30, Yan Zhao wrote:
hi All, could we decide that sysfs is the interface that every VFIO vendor driver needs to provide in order to support vfio live migration, otherwise the userspace management tool would not list the device into the compatible list?
if that's true, let's move to the standardizing of the sysfs interface. (1) content common part: (must) - software_version: (in major.minor.bugfix scheme)
This can not work for devices whose features can be negotiated/advertised independently. (E.g virtio devices)
sorry, I don't understand here, why virtio devices need to use vfio interface?
I don't see any reason that virtio devices can't be used by VFIO. Do you?
Actually, virtio devices have been used by VFIO for many years:
- passthrough a hardware virtio devices to userspace(VM) drivers
- using virtio PMD inside guest
So, what's different for it vs passing through a physical hardware via VFIO?
The difference is in the guest, the device could be either real hardware or emulated ones.
even though the features are negotiated dynamically, could you explain why it would cause software_version not work?
Virtio device 1 supports feature A, B, C Virtio device 2 supports feature B, C, D
So you can't migrate a guest from device 1 to device 2. And it's impossible to model the features with versions.
I think this thread is discussing about vfio related devices.
- device_api: vfio-pci or vfio-ccw ... - type: mdev type for mdev device or a signature for physical device which is a counterpart for mdev type.
device api specific part: (must) - pci id: pci id of mdev parent device or pci id of physical pci device (device_api is vfio-pci)API here.
So this assumes a PCI device which is probably not true.
for device_api of vfio-pci, why it's not true?
for vfio-ccw, it's subchannel_type.
Ok but having two different attributes for the same file is not good idea. How mgmt know there will be a 3rd type?
that's why some attributes need to be common. e.g. device_api: it's common because mgmt need to know it's a pci device or a ccw device. and the api type is already defined vfio.h. (The field is agreed by and actually suggested by Alex in previous mail) type: mdev_type for mdev. if mgmt does not understand it, it would not be able to create one compatible mdev device. software_version: mgmt can compare the major and minor if it understands this fields.
I think it would be helpful if you can describe how mgmt is expected to work step by step with the proposed sysfs API. This can help people to understand.
Thanks for the patience. Since sysfs is uABI, when accepted, we need support it forever. That's why we need to be careful.
- subchannel_type (device_api is vfio-ccw)
vendor driver specific part: (optional) - aggregator - chpid_type - remote_url
For "remote_url", just wonder if it's better to integrate or reuse the existing NVME management interface instead of duplicating it here. Otherwise it could be a burden for mgmt to learn. E.g vendor A may use "remote_url" but vendor B may use a different attribute.
it's vendor driver specific. vendor specific attributes are inevitable, and that's why we are discussing here of a way to standardizing of it.
Well, then you will end up with a very long list to discuss. E.g for networking devices, you will have "mac", "v(x)lan" and a lot of other.
Note that "remote_url" is not vendor specific but NVME (class/subsystem) specific.
yes, it's just NVMe specific. I added it as an example to show what is vendor specific. if one attribute is vendor specific across all vendors, then it's not vendor specific, it's already common attribute, right?
It's common but the issue is about naming and mgmt overhead. Unless you have a unified API per class (NVME, ethernet, etc), you can't prevent vendor from using another name instead of "remote_url".
The point is that if vendor/class specific part is unavoidable, why not making all of the attributes vendor specific?
some parts need to be common, as I listed above.
This is hard, unless VFIO knows the type of device (e.g it's a NVME or networking device).
our goal is that mgmt can use it without understanding the meaning of vendor specific attributes.
I'm not sure this is the correct design of uAPI. Is there something similar in the existing uAPIs?
And it might be hard to work for virtio devices.
NOTE: vendors are free to add attributes in this part with a restriction that this attribute is able to be configured with the same name in sysfs too. e.g.
Sysfs works well for common attributes belongs to a class, but I'm not sure it can work well for device/vendor specific attributes. Does this mean mgmt need to iterate all the attributes in both src and dst?
no. just attributes under migration directory.
for aggregator, there must be a sysfs attribute in device node /sys/devices/pci0000:00/0000:00:02.0/882cc4da-dede-11e7-9180-078a62063ab1/intel_vgpu/aggregator, so that the userspace tool is able to configure the target device according to source device's aggregator attribute.
(2) where and structure proposal 1: |- [path to device] |--- migration | |--- self | | |-software_version | | |-device_api | | |-type | | |-[pci_id or subchannel_type] | | |-<aggregator or chpid_type> | |--- compatible | | |-software_version | | |-device_api | | |-type | | |-[pci_id or subchannel_type] | | |-<aggregator or chpid_type> multiple compatible is allowed. attributes should be ASCII text files, preferably with only one value per file.
proposal 2: use bin_attribute. |- [path to device] |--- migration | |--- self | |--- compatible
so we can continue use multiline format. e.g. cat compatible software_version=0.1.0 device_api=vfio_pci type=i915-GVTg_V5_{val1:int:1,2,4,8} pci_id=80865963 aggregator={val1}/2
So basically two questions:
- how hard to standardize sysfs API for dealing with compatibility check (to
make it work for most types of devices)
sorry, I just know we are in the process of standardizing of it :)
It's not easy. As I said, the current design can't work for virtio devices and it's not hard to find other examples. I remember some Intel devices have bitmask based capability registers.
some Intel devices have bitmask based capability registers. so what?
You should at least make the proposed API working for your(Intel) own devices.
we have defined pci_id to identify the devices. even two different devices have equal PCI IDs, we still allow them to add vendor specific fields. e.g. for QAT, they can add alg_set to identify hardware supported algorithms.
Well, the point is to make sure the API not work only for some specific devices. If we agree with this, we need try to seek what is missed instead.
- how hard for the mgmt to learn with a vendor specific attributes (vs
existing management API)
what is existing management API?
It depends on the type of devices. E.g for NVME, we've already had one (/sys/kernel/config/nvme)?
if the device is binding to vfio or vfio-mdev, I believe this interface is not there.
So you want to duplicate some APIs with existing NVME ones?
Thanks
Thanks Yan
On Wed, 19 Aug 2020 17:28:38 +0800 Jason Wang jasowang@redhat.com wrote:
On 2020/8/19 下午4:13, Yan Zhao wrote:
On Wed, Aug 19, 2020 at 03:39:50PM +0800, Jason Wang wrote:
On 2020/8/19 下午2:59, Yan Zhao wrote:
On Wed, Aug 19, 2020 at 02:57:34PM +0800, Jason Wang wrote:
On 2020/8/19 上午11:30, Yan Zhao wrote:
hi All, could we decide that sysfs is the interface that every VFIO vendor driver needs to provide in order to support vfio live migration, otherwise the userspace management tool would not list the device into the compatible list?
if that's true, let's move to the standardizing of the sysfs interface. (1) content common part: (must) - software_version: (in major.minor.bugfix scheme)
This can not work for devices whose features can be negotiated/advertised independently. (E.g virtio devices)
I thought the 'software_version' was supposed to describe kind of a 'protocol version' for the data we transmit? I.e., you add a new field, you bump the version number.
sorry, I don't understand here, why virtio devices need to use vfio interface?
I don't see any reason that virtio devices can't be used by VFIO. Do you?
Actually, virtio devices have been used by VFIO for many years:
- passthrough a hardware virtio devices to userspace(VM) drivers
- using virtio PMD inside guest
So, what's different for it vs passing through a physical hardware via VFIO?
The difference is in the guest, the device could be either real hardware or emulated ones.
even though the features are negotiated dynamically, could you explain why it would cause software_version not work?
Virtio device 1 supports feature A, B, C Virtio device 2 supports feature B, C, D
So you can't migrate a guest from device 1 to device 2. And it's impossible to model the features with versions.
We're talking about the features offered by the device, right? Would it be sufficient to mandate that the target device supports the same features or a superset of the features supported by the source device?
I think this thread is discussing about vfio related devices.
- device_api: vfio-pci or vfio-ccw ... - type: mdev type for mdev device or a signature for physical device which is a counterpart for mdev type.
device api specific part: (must) - pci id: pci id of mdev parent device or pci id of physical pci device (device_api is vfio-pci)API here.
So this assumes a PCI device which is probably not true.
for device_api of vfio-pci, why it's not true?
for vfio-ccw, it's subchannel_type.
Ok but having two different attributes for the same file is not good idea. How mgmt know there will be a 3rd type?
that's why some attributes need to be common. e.g. device_api: it's common because mgmt need to know it's a pci device or a ccw device. and the api type is already defined vfio.h. (The field is agreed by and actually suggested by Alex in previous mail) type: mdev_type for mdev. if mgmt does not understand it, it would not be able to create one compatible mdev device. software_version: mgmt can compare the major and minor if it understands this fields.
I think it would be helpful if you can describe how mgmt is expected to work step by step with the proposed sysfs API. This can help people to understand.
My proposal would be: - check that device_api matches - check possible device_api specific attributes - check that type matches [I don't think the combination of mdev types and another attribute to determine compatibility is a good idea; actually, the current proposal confuses me every time I look at it] - check that software_version is compatible, assuming semantic versioning - check possible type-specific attributes
Thanks for the patience. Since sysfs is uABI, when accepted, we need support it forever. That's why we need to be careful.
Nod.
(...)
On 2020/8/20 下午8:27, Cornelia Huck wrote:
On Wed, 19 Aug 2020 17:28:38 +0800 Jason Wang jasowang@redhat.com wrote:
On 2020/8/19 下午4:13, Yan Zhao wrote:
On Wed, Aug 19, 2020 at 03:39:50PM +0800, Jason Wang wrote:
On 2020/8/19 下午2:59, Yan Zhao wrote:
On Wed, Aug 19, 2020 at 02:57:34PM +0800, Jason Wang wrote:
On 2020/8/19 上午11:30, Yan Zhao wrote: > hi All, > could we decide that sysfs is the interface that every VFIO vendor driver > needs to provide in order to support vfio live migration, otherwise the > userspace management tool would not list the device into the compatible > list? > > if that's true, let's move to the standardizing of the sysfs interface. > (1) content > common part: (must) > - software_version: (in major.minor.bugfix scheme) This can not work for devices whose features can be negotiated/advertised independently. (E.g virtio devices)
I thought the 'software_version' was supposed to describe kind of a 'protocol version' for the data we transmit? I.e., you add a new field, you bump the version number.
Ok, but since we mandate backward compatibility of uABI, is this really worth to have a version for sysfs? (Searching on sysfs shows no examples like this)
sorry, I don't understand here, why virtio devices need to use vfio interface?
I don't see any reason that virtio devices can't be used by VFIO. Do you?
Actually, virtio devices have been used by VFIO for many years:
- passthrough a hardware virtio devices to userspace(VM) drivers
- using virtio PMD inside guest
So, what's different for it vs passing through a physical hardware via VFIO?
The difference is in the guest, the device could be either real hardware or emulated ones.
even though the features are negotiated dynamically, could you explain why it would cause software_version not work?
Virtio device 1 supports feature A, B, C Virtio device 2 supports feature B, C, D
So you can't migrate a guest from device 1 to device 2. And it's impossible to model the features with versions.
We're talking about the features offered by the device, right? Would it be sufficient to mandate that the target device supports the same features or a superset of the features supported by the source device?
Yes.
I think this thread is discussing about vfio related devices.
> - device_api: vfio-pci or vfio-ccw ... > - type: mdev type for mdev device or > a signature for physical device which is a counterpart for > mdev type. > > device api specific part: (must) > - pci id: pci id of mdev parent device or pci id of physical pci > device (device_api is vfio-pci)API here. So this assumes a PCI device which is probably not true.
for device_api of vfio-pci, why it's not true?
for vfio-ccw, it's subchannel_type.
Ok but having two different attributes for the same file is not good idea. How mgmt know there will be a 3rd type?
that's why some attributes need to be common. e.g. device_api: it's common because mgmt need to know it's a pci device or a ccw device. and the api type is already defined vfio.h. (The field is agreed by and actually suggested by Alex in previous mail) type: mdev_type for mdev. if mgmt does not understand it, it would not be able to create one compatible mdev device. software_version: mgmt can compare the major and minor if it understands this fields.
I think it would be helpful if you can describe how mgmt is expected to work step by step with the proposed sysfs API. This can help people to understand.
My proposal would be:
- check that device_api matches
- check possible device_api specific attributes
- check that type matches [I don't think the combination of mdev types and another attribute to determine compatibility is a good idea;
Any reason for this? Actually if we only use mdev type to detect the compatibility, it would be much more easier. Otherwise, we are actually re-inventing mdev types.
E.g can we have the same mdev types with different device_api and other attributes?
actually, the current proposal confuses me every time I look at it]
- check that software_version is compatible, assuming semantic versioning
- check possible type-specific attributes
I'm not sure if this is too complicated. And I suspect there will be vendor specific attributes:
- for compatibility check: I think we should either modeling everything via mdev type or making it totally vendor specific. Having something in the middle will bring a lot of burden - for provisioning: it's still not clear. As shown in this proposal, for NVME we may need to set remote_url, but unless there will be a subclass (NVME) in the mdev (which I guess not), we can't prevent vendor from using another attribute name, in this case, tricks like attributes iteration in some sub directory won't work. So even if we had some common API for compatibility check, the provisioning API is still vendor specific ...
Thanks
Thanks for the patience. Since sysfs is uABI, when accepted, we need support it forever. That's why we need to be careful.
Nod.
(...)
On Fri, 21 Aug 2020 11:14:41 +0800 Jason Wang jasowang@redhat.com wrote:
On 2020/8/20 下午8:27, Cornelia Huck wrote:
On Wed, 19 Aug 2020 17:28:38 +0800 Jason Wang jasowang@redhat.com wrote:
On 2020/8/19 下午4:13, Yan Zhao wrote:
On Wed, Aug 19, 2020 at 03:39:50PM +0800, Jason Wang wrote:
On 2020/8/19 下午2:59, Yan Zhao wrote:
On Wed, Aug 19, 2020 at 02:57:34PM +0800, Jason Wang wrote: > On 2020/8/19 上午11:30, Yan Zhao wrote: >> hi All, >> could we decide that sysfs is the interface that every VFIO vendor driver >> needs to provide in order to support vfio live migration, otherwise the >> userspace management tool would not list the device into the compatible >> list? >> >> if that's true, let's move to the standardizing of the sysfs interface. >> (1) content >> common part: (must) >> - software_version: (in major.minor.bugfix scheme) > This can not work for devices whose features can be negotiated/advertised > independently. (E.g virtio devices)
I thought the 'software_version' was supposed to describe kind of a 'protocol version' for the data we transmit? I.e., you add a new field, you bump the version number.
Ok, but since we mandate backward compatibility of uABI, is this really worth to have a version for sysfs? (Searching on sysfs shows no examples like this)
I was not thinking about the sysfs interface, but rather about the data that is sent over while migrating. E.g. we find out that sending some auxiliary data is a good idea and bump to version 1.1.0; version 1.0.0 cannot deal with the extra data, but version 1.1.0 can deal with the older data stream.
(...)
>> - device_api: vfio-pci or vfio-ccw ... >> - type: mdev type for mdev device or >> a signature for physical device which is a counterpart for >> mdev type. >> >> device api specific part: (must) >> - pci id: pci id of mdev parent device or pci id of physical pci >> device (device_api is vfio-pci)API here. > So this assumes a PCI device which is probably not true. > for device_api of vfio-pci, why it's not true?
for vfio-ccw, it's subchannel_type.
Ok but having two different attributes for the same file is not good idea. How mgmt know there will be a 3rd type?
that's why some attributes need to be common. e.g. device_api: it's common because mgmt need to know it's a pci device or a ccw device. and the api type is already defined vfio.h. (The field is agreed by and actually suggested by Alex in previous mail) type: mdev_type for mdev. if mgmt does not understand it, it would not be able to create one compatible mdev device. software_version: mgmt can compare the major and minor if it understands this fields.
I think it would be helpful if you can describe how mgmt is expected to work step by step with the proposed sysfs API. This can help people to understand.
My proposal would be:
- check that device_api matches
- check possible device_api specific attributes
- check that type matches [I don't think the combination of mdev types and another attribute to determine compatibility is a good idea;
Any reason for this? Actually if we only use mdev type to detect the compatibility, it would be much more easier. Otherwise, we are actually re-inventing mdev types.
E.g can we have the same mdev types with different device_api and other attributes?
In the end, the mdev type is represented as a string; but I'm not sure we can expect that two types with the same name, but a different device_api are related in any way.
If we e.g. compare vfio-pci and vfio-ccw, they are fundamentally different.
I was mostly concerned about the aggregation proposal, where type A + aggregation value b might be compatible with type B + aggregation value a.
actually, the current proposal confuses me every time I look at it]
- check that software_version is compatible, assuming semantic versioning
- check possible type-specific attributes
I'm not sure if this is too complicated. And I suspect there will be vendor specific attributes:
- for compatibility check: I think we should either modeling everything
via mdev type or making it totally vendor specific. Having something in the middle will bring a lot of burden
FWIW, I'm for a strict match on mdev type, and flexibility in per-type attributes.
- for provisioning: it's still not clear. As shown in this proposal, for
NVME we may need to set remote_url, but unless there will be a subclass (NVME) in the mdev (which I guess not), we can't prevent vendor from using another attribute name, in this case, tricks like attributes iteration in some sub directory won't work. So even if we had some common API for compatibility check, the provisioning API is still vendor specific ...
Yes, I'm not sure how to deal with the "same thing for different vendors" problem. We can try to make sure that in-kernel drivers play nicely, but not much more.
On 2020/8/21 下午10:52, Cornelia Huck wrote:
On Fri, 21 Aug 2020 11:14:41 +0800 Jason Wang jasowang@redhat.com wrote:
On 2020/8/20 下午8:27, Cornelia Huck wrote:
On Wed, 19 Aug 2020 17:28:38 +0800 Jason Wang jasowang@redhat.com wrote:
On 2020/8/19 下午4:13, Yan Zhao wrote:
On Wed, Aug 19, 2020 at 03:39:50PM +0800, Jason Wang wrote:
On 2020/8/19 下午2:59, Yan Zhao wrote: > On Wed, Aug 19, 2020 at 02:57:34PM +0800, Jason Wang wrote: >> On 2020/8/19 上午11:30, Yan Zhao wrote: >>> hi All, >>> could we decide that sysfs is the interface that every VFIO vendor driver >>> needs to provide in order to support vfio live migration, otherwise the >>> userspace management tool would not list the device into the compatible >>> list? >>> >>> if that's true, let's move to the standardizing of the sysfs interface. >>> (1) content >>> common part: (must) >>> - software_version: (in major.minor.bugfix scheme) >> This can not work for devices whose features can be negotiated/advertised >> independently. (E.g virtio devices)
I thought the 'software_version' was supposed to describe kind of a 'protocol version' for the data we transmit? I.e., you add a new field, you bump the version number.
Ok, but since we mandate backward compatibility of uABI, is this really worth to have a version for sysfs? (Searching on sysfs shows no examples like this)
I was not thinking about the sysfs interface, but rather about the data that is sent over while migrating. E.g. we find out that sending some auxiliary data is a good idea and bump to version 1.1.0; version 1.0.0 cannot deal with the extra data, but version 1.1.0 can deal with the older data stream.
(...)
Well, I think what data to transmit during migration is the duty of qemu not kernel. And I suspect the idea of reading opaque data (with version) from kernel and transmit them to dest is the best approach.
>>> - device_api: vfio-pci or vfio-ccw ... >>> - type: mdev type for mdev device or >>> a signature for physical device which is a counterpart for >>> mdev type. >>> >>> device api specific part: (must) >>> - pci id: pci id of mdev parent device or pci id of physical pci >>> device (device_api is vfio-pci)API here. >> So this assumes a PCI device which is probably not true. >> > for device_api of vfio-pci, why it's not true? > > for vfio-ccw, it's subchannel_type. Ok but having two different attributes for the same file is not good idea. How mgmt know there will be a 3rd type?
that's why some attributes need to be common. e.g. device_api: it's common because mgmt need to know it's a pci device or a ccw device. and the api type is already defined vfio.h. (The field is agreed by and actually suggested by Alex in previous mail) type: mdev_type for mdev. if mgmt does not understand it, it would not be able to create one compatible mdev device. software_version: mgmt can compare the major and minor if it understands this fields.
I think it would be helpful if you can describe how mgmt is expected to work step by step with the proposed sysfs API. This can help people to understand.
My proposal would be:
- check that device_api matches
- check possible device_api specific attributes
- check that type matches [I don't think the combination of mdev types and another attribute to determine compatibility is a good idea;
Any reason for this? Actually if we only use mdev type to detect the compatibility, it would be much more easier. Otherwise, we are actually re-inventing mdev types.
E.g can we have the same mdev types with different device_api and other attributes?
In the end, the mdev type is represented as a string; but I'm not sure we can expect that two types with the same name, but a different device_api are related in any way.
If we e.g. compare vfio-pci and vfio-ccw, they are fundamentally different.
I was mostly concerned about the aggregation proposal, where type A + aggregation value b might be compatible with type B + aggregation value a.
Yes, that looks pretty complicated.
actually, the current proposal confuses me every time I look at it]
- check that software_version is compatible, assuming semantic versioning
- check possible type-specific attributes
I'm not sure if this is too complicated. And I suspect there will be vendor specific attributes:
- for compatibility check: I think we should either modeling everything
via mdev type or making it totally vendor specific. Having something in the middle will bring a lot of burden
FWIW, I'm for a strict match on mdev type, and flexibility in per-type attributes.
I'm not sure whether the above flexibility can work better than encoding them to mdev type. If we really want ultra flexibility, we need making the compatibility check totally vendor specific.
- for provisioning: it's still not clear. As shown in this proposal, for
NVME we may need to set remote_url, but unless there will be a subclass (NVME) in the mdev (which I guess not), we can't prevent vendor from using another attribute name, in this case, tricks like attributes iteration in some sub directory won't work. So even if we had some common API for compatibility check, the provisioning API is still vendor specific ...
Yes, I'm not sure how to deal with the "same thing for different vendors" problem. We can try to make sure that in-kernel drivers play nicely, but not much more.
Then it's actually a subclass of mdev I guess in the future.
Thanks
On Wed, 19 Aug 2020 11:30:35 +0800 Yan Zhao yan.y.zhao@intel.com wrote:
On Tue, Aug 18, 2020 at 09:39:24AM +0000, Parav Pandit wrote:
Hi Cornelia,
From: Cornelia Huck cohuck@redhat.com Sent: Tuesday, August 18, 2020 3:07 PM To: Daniel P. Berrangé berrange@redhat.com Cc: Jason Wang jasowang@redhat.com; Yan Zhao yan.y.zhao@intel.com; kvm@vger.kernel.org; libvir-list@redhat.com; qemu-devel@nongnu.org; Kirti Wankhede kwankhede@nvidia.com; eauger@redhat.com; xin-ran.wang@intel.com; corbet@lwn.net; openstack- discuss@lists.openstack.org; shaohe.feng@intel.com; kevin.tian@intel.com; Parav Pandit parav@mellanox.com; jian-feng.ding@intel.com; dgilbert@redhat.com; zhenyuw@linux.intel.com; hejie.xu@intel.com; bao.yumeng@zte.com.cn; Alex Williamson alex.williamson@redhat.com; eskultet@redhat.com; smooney@redhat.com; intel-gvt- dev@lists.freedesktop.org; Jiri Pirko jiri@mellanox.com; dinechin@redhat.com; devel@ovirt.org Subject: Re: device compatibility interface for live migration with assigned devices
On Tue, 18 Aug 2020 10:16:28 +0100 Daniel P. Berrangé berrange@redhat.com wrote:
On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote:
On 2020/8/18 下午4:55, Daniel P. Berrangé wrote:
On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote:
On 2020/8/14 下午1:16, Yan Zhao wrote:
On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:
On 2020/8/10 下午3:46, Yan Zhao wrote:
we actually can also retrieve the same information through sysfs, .e.g
|- [path to device] |--- migration | |--- self | | |---device_api | | |---mdev_type | | |---software_version | | |---device_id | | |---aggregator | |--- compatible | | |---device_api | | |---mdev_type | | |---software_version | | |---device_id | | |---aggregator
Yes but:
- You need one file per attribute (one syscall for one attribute)
- Attribute is coupled with kobject
Is that really that bad? You have the device with an embedded kobject anyway, and you can just put things into an attribute group?
[Also, I think that self/compatible split in the example makes things needlessly complex. Shouldn't semantic versioning and matching already cover nearly everything? I would expect very few cases that are more complex than that. Maybe the aggregation stuff, but I don't think we need that self/compatible split for that, either.]
All of above seems unnecessary.
Another point, as we discussed in another thread, it's really hard to make sure the above API work for all types of devices and frameworks. So having a vendor specific API looks much better.
From the POV of userspace mgmt apps doing device compat checking / migration, we certainly do NOT want to use different vendor specific APIs. We want to have an API that can be used / controlled in a
standard manner across vendors.
Yes, but it could be hard. E.g vDPA will chose to use devlink (there's a long debate on sysfs vs devlink). So if we go with sysfs, at least two APIs needs to be supported ...
NB, I was not questioning devlink vs sysfs directly. If devlink is related to netlink, I can't say I'm enthusiastic as IMKE sysfs is easier to deal with. I don't know enough about devlink to have much of an
opinion though.
The key point was that I don't want the userspace APIs we need to deal with to be vendor specific.
From what I've seen of devlink, it seems quite nice; but I understand why sysfs might be easier to deal with (especially as there's likely already a lot of code using it.)
I understand that some users would like devlink because it is already widely used for network drivers (and some others), but I don't think the majority of devices used with vfio are network (although certainly a lot of them are.)
What I care about is that we have a *standard* userspace API for performing device compatibility checking / state migration, for use by QEMU/libvirt/ OpenStack, such that we can write code without countless vendor specific code paths.
If there is vendor specific stuff on the side, that's fine as we can ignore that, but the core functionality for device compat / migration needs to be standardized.
To summarize:
- choose one of sysfs or devlink
- have a common interface, with a standardized way to add vendor-specific attributes
?
Please refer to my previous email which has more example and details.
hi Parav, the example is based on a new vdpa tool running over netlink, not based on devlink, right? For vfio migration compatibility, we have to deal with both mdev and physical pci devices, I don't think it's a good idea to write a new tool for it, given we are able to retrieve the same info from sysfs and there's already an mdevctl from Alex (https://github.com/mdevctl/mdevctl).
hi All, could we decide that sysfs is the interface that every VFIO vendor driver needs to provide in order to support vfio live migration, otherwise the userspace management tool would not list the device into the compatible list?
if that's true, let's move to the standardizing of the sysfs interface. (1) content common part: (must)
- software_version: (in major.minor.bugfix scheme)
- device_api: vfio-pci or vfio-ccw ...
- type: mdev type for mdev device or a signature for physical device which is a counterpart for mdev type.
device api specific part: (must)
- pci id: pci id of mdev parent device or pci id of physical pci device (device_api is vfio-pci)
As noted previously, the parent PCI ID should not matter for an mdev device, if a vendor has a dependency on matching the parent device PCI ID, that's a vendor specific restriction. An mdev device can also expose a vfio-pci device API without the parent device being PCI. For a physical PCI device, shouldn't the PCI ID be encompassed in the signature? Thanks,
Alex
- subchannel_type (device_api is vfio-ccw)
vendor driver specific part: (optional)
- aggregator
- chpid_type
- remote_url
NOTE: vendors are free to add attributes in this part with a restriction that this attribute is able to be configured with the same name in sysfs too. e.g. for aggregator, there must be a sysfs attribute in device node /sys/devices/pci0000:00/0000:00:02.0/882cc4da-dede-11e7-9180-078a62063ab1/intel_vgpu/aggregator, so that the userspace tool is able to configure the target device according to source device's aggregator attribute.
(2) where and structure proposal 1: |- [path to device] |--- migration | |--- self | | |-software_version | | |-device_api | | |-type | | |-[pci_id or subchannel_type] | | |-<aggregator or chpid_type> | |--- compatible | | |-software_version | | |-device_api | | |-type | | |-[pci_id or subchannel_type] | | |-<aggregator or chpid_type> multiple compatible is allowed. attributes should be ASCII text files, preferably with only one value per file.
proposal 2: use bin_attribute. |- [path to device] |--- migration | |--- self | |--- compatible
so we can continue use multiline format. e.g. cat compatible software_version=0.1.0 device_api=vfio_pci type=i915-GVTg_V5_{val1:int:1,2,4,8} pci_id=80865963 aggregator={val1}/2
Thanks Yan
On Wed, Aug 19, 2020 at 11:50:21AM -0600, Alex Williamson wrote: <...>
What I care about is that we have a *standard* userspace API for performing device compatibility checking / state migration, for use by QEMU/libvirt/ OpenStack, such that we can write code without countless vendor specific code paths.
If there is vendor specific stuff on the side, that's fine as we can ignore that, but the core functionality for device compat / migration needs to be standardized.
To summarize:
- choose one of sysfs or devlink
- have a common interface, with a standardized way to add vendor-specific attributes
?
Please refer to my previous email which has more example and details.
hi Parav, the example is based on a new vdpa tool running over netlink, not based on devlink, right? For vfio migration compatibility, we have to deal with both mdev and physical pci devices, I don't think it's a good idea to write a new tool for it, given we are able to retrieve the same info from sysfs and there's already an mdevctl from Alex (https://github.com/mdevctl/mdevctl).
hi All, could we decide that sysfs is the interface that every VFIO vendor driver needs to provide in order to support vfio live migration, otherwise the userspace management tool would not list the device into the compatible list?
if that's true, let's move to the standardizing of the sysfs interface. (1) content common part: (must)
- software_version: (in major.minor.bugfix scheme)
- device_api: vfio-pci or vfio-ccw ...
- type: mdev type for mdev device or a signature for physical device which is a counterpart for mdev type.
device api specific part: (must)
- pci id: pci id of mdev parent device or pci id of physical pci device (device_api is vfio-pci)
As noted previously, the parent PCI ID should not matter for an mdev device, if a vendor has a dependency on matching the parent device PCI ID, that's a vendor specific restriction. An mdev device can also expose a vfio-pci device API without the parent device being PCI. For a physical PCI device, shouldn't the PCI ID be encompassed in the signature? Thanks,
you are right. I need to put the PCI ID as a vendor specific field. I didn't do that because I wanted all fields in vendor specific to be configurable by management tools, so they can configure the target device according to the value of a vendor specific field even they don't know the meaning of the field. But maybe they can just ignore the field when they can't find a matching writable field to configure the target.
Thanks Yan
- subchannel_type (device_api is vfio-ccw)
vendor driver specific part: (optional)
- aggregator
- chpid_type
- remote_url
NOTE: vendors are free to add attributes in this part with a restriction that this attribute is able to be configured with the same name in sysfs too. e.g. for aggregator, there must be a sysfs attribute in device node /sys/devices/pci0000:00/0000:00:02.0/882cc4da-dede-11e7-9180-078a62063ab1/intel_vgpu/aggregator, so that the userspace tool is able to configure the target device according to source device's aggregator attribute.
(2) where and structure proposal 1: |- [path to device] |--- migration | |--- self | | |-software_version | | |-device_api | | |-type | | |-[pci_id or subchannel_type] | | |-<aggregator or chpid_type> | |--- compatible | | |-software_version | | |-device_api | | |-type | | |-[pci_id or subchannel_type] | | |-<aggregator or chpid_type> multiple compatible is allowed. attributes should be ASCII text files, preferably with only one value per file.
proposal 2: use bin_attribute. |- [path to device] |--- migration | |--- self | |--- compatible
so we can continue use multiline format. e.g. cat compatible software_version=0.1.0 device_api=vfio_pci type=i915-GVTg_V5_{val1:int:1,2,4,8} pci_id=80865963 aggregator={val1}/2
Thanks Yan
On Thu, 20 Aug 2020 08:18:10 +0800 Yan Zhao yan.y.zhao@intel.com wrote:
On Wed, Aug 19, 2020 at 11:50:21AM -0600, Alex Williamson wrote: <...>
What I care about is that we have a *standard* userspace API for performing device compatibility checking / state migration, for use by QEMU/libvirt/ OpenStack, such that we can write code without countless vendor specific code paths.
If there is vendor specific stuff on the side, that's fine as we can ignore that, but the core functionality for device compat / migration needs to be standardized.
To summarize:
- choose one of sysfs or devlink
- have a common interface, with a standardized way to add vendor-specific attributes
?
Please refer to my previous email which has more example and details.
hi Parav, the example is based on a new vdpa tool running over netlink, not based on devlink, right? For vfio migration compatibility, we have to deal with both mdev and physical pci devices, I don't think it's a good idea to write a new tool for it, given we are able to retrieve the same info from sysfs and there's already an mdevctl from Alex (https://github.com/mdevctl/mdevctl).
hi All, could we decide that sysfs is the interface that every VFIO vendor driver needs to provide in order to support vfio live migration, otherwise the userspace management tool would not list the device into the compatible list?
if that's true, let's move to the standardizing of the sysfs interface. (1) content common part: (must)
- software_version: (in major.minor.bugfix scheme)
- device_api: vfio-pci or vfio-ccw ...
- type: mdev type for mdev device or a signature for physical device which is a counterpart for mdev type.
device api specific part: (must)
- pci id: pci id of mdev parent device or pci id of physical pci device (device_api is vfio-pci)
As noted previously, the parent PCI ID should not matter for an mdev device, if a vendor has a dependency on matching the parent device PCI ID, that's a vendor specific restriction. An mdev device can also expose a vfio-pci device API without the parent device being PCI. For a physical PCI device, shouldn't the PCI ID be encompassed in the signature? Thanks,
you are right. I need to put the PCI ID as a vendor specific field. I didn't do that because I wanted all fields in vendor specific to be configurable by management tools, so they can configure the target device according to the value of a vendor specific field even they don't know the meaning of the field. But maybe they can just ignore the field when they can't find a matching writable field to configure the target.
If fields can be ignored, what's the point of reporting them? Seems it's no longer a requirement. Thanks,
Alex
- subchannel_type (device_api is vfio-ccw)
vendor driver specific part: (optional)
- aggregator
- chpid_type
- remote_url
NOTE: vendors are free to add attributes in this part with a restriction that this attribute is able to be configured with the same name in sysfs too. e.g. for aggregator, there must be a sysfs attribute in device node /sys/devices/pci0000:00/0000:00:02.0/882cc4da-dede-11e7-9180-078a62063ab1/intel_vgpu/aggregator, so that the userspace tool is able to configure the target device according to source device's aggregator attribute.
(2) where and structure proposal 1: |- [path to device] |--- migration | |--- self | | |-software_version | | |-device_api | | |-type | | |-[pci_id or subchannel_type] | | |-<aggregator or chpid_type> | |--- compatible | | |-software_version | | |-device_api | | |-type | | |-[pci_id or subchannel_type] | | |-<aggregator or chpid_type> multiple compatible is allowed. attributes should be ASCII text files, preferably with only one value per file.
proposal 2: use bin_attribute. |- [path to device] |--- migration | |--- self | |--- compatible
so we can continue use multiline format. e.g. cat compatible software_version=0.1.0 device_api=vfio_pci type=i915-GVTg_V5_{val1:int:1,2,4,8} pci_id=80865963 aggregator={val1}/2
Thanks Yan
On Wed, Aug 19, 2020 at 09:13:45PM -0600, Alex Williamson wrote:
On Thu, 20 Aug 2020 08:18:10 +0800 Yan Zhao yan.y.zhao@intel.com wrote:
On Wed, Aug 19, 2020 at 11:50:21AM -0600, Alex Williamson wrote: <...>
> What I care about is that we have a *standard* userspace API for > performing device compatibility checking / state migration, for use by > QEMU/libvirt/ OpenStack, such that we can write code without countless > vendor specific code paths. > > If there is vendor specific stuff on the side, that's fine as we can > ignore that, but the core functionality for device compat / migration > needs to be standardized.
To summarize:
- choose one of sysfs or devlink
- have a common interface, with a standardized way to add vendor-specific attributes
?
Please refer to my previous email which has more example and details.
hi Parav, the example is based on a new vdpa tool running over netlink, not based on devlink, right? For vfio migration compatibility, we have to deal with both mdev and physical pci devices, I don't think it's a good idea to write a new tool for it, given we are able to retrieve the same info from sysfs and there's already an mdevctl from Alex (https://github.com/mdevctl/mdevctl).
hi All, could we decide that sysfs is the interface that every VFIO vendor driver needs to provide in order to support vfio live migration, otherwise the userspace management tool would not list the device into the compatible list?
if that's true, let's move to the standardizing of the sysfs interface. (1) content common part: (must)
- software_version: (in major.minor.bugfix scheme)
- device_api: vfio-pci or vfio-ccw ...
- type: mdev type for mdev device or a signature for physical device which is a counterpart for mdev type.
device api specific part: (must)
- pci id: pci id of mdev parent device or pci id of physical pci device (device_api is vfio-pci)
As noted previously, the parent PCI ID should not matter for an mdev device, if a vendor has a dependency on matching the parent device PCI ID, that's a vendor specific restriction. An mdev device can also expose a vfio-pci device API without the parent device being PCI. For a physical PCI device, shouldn't the PCI ID be encompassed in the signature? Thanks,
you are right. I need to put the PCI ID as a vendor specific field. I didn't do that because I wanted all fields in vendor specific to be configurable by management tools, so they can configure the target device according to the value of a vendor specific field even they don't know the meaning of the field. But maybe they can just ignore the field when they can't find a matching writable field to configure the target.
If fields can be ignored, what's the point of reporting them? Seems it's no longer a requirement. Thanks,
sorry about the confusion. I mean this condition: about to migrate, openstack searches if there are existing matching MDEVs, if yes, i.e. all common/vendor specific fields match, then just create a VM with the matching target MDEV. (in this condition, the PCI ID field is not ignored); if not, openstack tries to create one MDEV according to mdev_type, and configures MDEV according to the vendor specific attributes. as PCI ID is not a configurable field, it just ignore the field.
Thanks Yan
On 2020/8/18 下午5:36, Cornelia Huck wrote:
On Tue, 18 Aug 2020 10:16:28 +0100 Daniel P. Berrangé berrange@redhat.com wrote:
On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote:
On 2020/8/18 下午4:55, Daniel P. Berrangé wrote:
On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote:
On 2020/8/14 下午1:16, Yan Zhao wrote:
On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:
On 2020/8/10 下午3:46, Yan Zhao wrote: we actually can also retrieve the same information through sysfs, .e.g
|- [path to device] |--- migration | |--- self | | |---device_api | | |---mdev_type | | |---software_version | | |---device_id | | |---aggregator | |--- compatible | | |---device_api | | |---mdev_type | | |---software_version | | |---device_id | | |---aggregator
Yes but:
- You need one file per attribute (one syscall for one attribute)
- Attribute is coupled with kobject
Is that really that bad? You have the device with an embedded kobject anyway, and you can just put things into an attribute group?
Yes, but all of this could be done via devlink(netlink) as well with low overhead.
[Also, I think that self/compatible split in the example makes things needlessly complex. Shouldn't semantic versioning and matching already cover nearly everything?
That's my question as well. E.g for virtio, versioning may not even work, some of features are negotiated independently:
Source features: A, B, C Dest features: A, B, C, E
We just need to make sure the dest features is a superset of source then all set.
I would expect very few cases that are more complex than that. Maybe the aggregation stuff, but I don't think we need that self/compatible split for that, either.]
All of above seems unnecessary.
Another point, as we discussed in another thread, it's really hard to make sure the above API work for all types of devices and frameworks. So having a vendor specific API looks much better.
From the POV of userspace mgmt apps doing device compat checking / migration, we certainly do NOT want to use different vendor specific APIs. We want to have an API that can be used / controlled in a standard manner across vendors.
Yes, but it could be hard. E.g vDPA will chose to use devlink (there's a long debate on sysfs vs devlink). So if we go with sysfs, at least two APIs needs to be supported ...
NB, I was not questioning devlink vs sysfs directly. If devlink is related to netlink, I can't say I'm enthusiastic as IMKE sysfs is easier to deal with. I don't know enough about devlink to have much of an opinion though. The key point was that I don't want the userspace APIs we need to deal with to be vendor specific.
From what I've seen of devlink, it seems quite nice; but I understand why sysfs might be easier to deal with (especially as there's likely already a lot of code using it.)
I understand that some users would like devlink because it is already widely used for network drivers (and some others), but I don't think the majority of devices used with vfio are network (although certainly a lot of them are.)
Note that though devlink could be popular only in network devices, netlink is widely used by a lot of subsystesm (e.g SCSI).
Thanks
What I care about is that we have a *standard* userspace API for performing device compatibility checking / state migration, for use by QEMU/libvirt/ OpenStack, such that we can write code without countless vendor specific code paths.
If there is vendor specific stuff on the side, that's fine as we can ignore that, but the core functionality for device compat / migration needs to be standardized.
To summarize:
- choose one of sysfs or devlink
- have a common interface, with a standardized way to add vendor-specific attributes
?
On Tue, Aug 18, 2020 at 11:36:52AM +0200, Cornelia Huck wrote:
On Tue, 18 Aug 2020 10:16:28 +0100 Daniel P. Berrangé berrange@redhat.com wrote:
On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote:
On 2020/8/18 下午4:55, Daniel P. Berrangé wrote:
On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote:
On 2020/8/14 下午1:16, Yan Zhao wrote:
On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:
On 2020/8/10 下午3:46, Yan Zhao wrote:
we actually can also retrieve the same information through sysfs, .e.g
|- [path to device] |--- migration | |--- self | | |---device_api | | |---mdev_type | | |---software_version | | |---device_id | | |---aggregator | |--- compatible | | |---device_api | | |---mdev_type | | |---software_version | | |---device_id | | |---aggregator
Yes but:
- You need one file per attribute (one syscall for one attribute)
- Attribute is coupled with kobject
Is that really that bad? You have the device with an embedded kobject anyway, and you can just put things into an attribute group?
[Also, I think that self/compatible split in the example makes things needlessly complex. Shouldn't semantic versioning and matching already cover nearly everything? I would expect very few cases that are more complex than that. Maybe the aggregation stuff, but I don't think we need that self/compatible split for that, either.]
Hi Cornelia,
The reason I want to declare compatible list of attributes is that sometimes it's not a simple 1:1 matching of source attributes and target attributes as I demonstrated below, source mdev of (mdev_type i915-GVTg_V5_2 + aggregator 1) is compatible to target mdev of (mdev_type i915-GVTg_V5_4 + aggregator 2), (mdev_type i915-GVTg_V5_8 + aggregator 4)
and aggragator may be just one of such examples that 1:1 matching does not fit.
So, we explicitly list out self/compatible attributes, and management tools only need to check if self attributes is contained compatible attributes.
or do you mean only compatible list is enough, and the management tools need to find out self list by themselves? But I think provide a self list is easier for management tools.
Thanks Yan
On Thu, 2020-08-20 at 08:39 +0800, Yan Zhao wrote:
On Tue, Aug 18, 2020 at 11:36:52AM +0200, Cornelia Huck wrote:
On Tue, 18 Aug 2020 10:16:28 +0100 Daniel P. Berrangé berrange@redhat.com wrote:
On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote:
On 2020/8/18 下午4:55, Daniel P. Berrangé wrote:
On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote:
On 2020/8/14 下午1:16, Yan Zhao wrote:
On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:
On 2020/8/10 下午3:46, Yan Zhao wrote: we actually can also retrieve the same information through sysfs, .e.g
|- [path to device] |--- migration | |--- self | | |---device_api | | |---mdev_type | | |---software_version | | |---device_id | | |---aggregator | |--- compatible | | |---device_api | | |---mdev_type | | |---software_version | | |---device_id | | |---aggregator
Yes but:
- You need one file per attribute (one syscall for one attribute)
- Attribute is coupled with kobject
Is that really that bad? You have the device with an embedded kobject anyway, and you can just put things into an attribute group?
[Also, I think that self/compatible split in the example makes things needlessly complex. Shouldn't semantic versioning and matching already cover nearly everything? I would expect very few cases that are more complex than that. Maybe the aggregation stuff, but I don't think we need that self/compatible split for that, either.]
Hi Cornelia,
The reason I want to declare compatible list of attributes is that sometimes it's not a simple 1:1 matching of source attributes and target attributes as I demonstrated below, source mdev of (mdev_type i915-GVTg_V5_2 + aggregator 1) is compatible to target mdev of (mdev_type i915-GVTg_V5_4 + aggregator 2), (mdev_type i915-GVTg_V5_8 + aggregator 4)
the way you are doing the nameing is till really confusing by the way if this has not already been merged in the kernel can you chagne the mdev so that mdev_type i915-GVTg_V5_2 is 2 of mdev_type i915-GVTg_V5_1 instead of half the device
currently you need to deived the aggratod by the number at the end of the mdev type to figure out how much of the phsicial device is being used with is a very unfridly api convention
the way aggrator are being proposed in general is not really someting i like but i thin this at least is something that should be able to correct.
with the complexity in the mdev type name + aggrator i suspect that this will never be support in openstack nova directly requireing integration via cyborg unless we can pre partion the device in to mdevs staicaly and just ignore this.
this is way to vendor sepecif to integrate into something like openstack in nova unless we can guarentee taht how aggreator work will be portable across vendors genericly.
and aggragator may be just one of such examples that 1:1 matching does not fit.
for openstack nova i dont see us support anything beyond the 1:1 case where the mdev type does not change.
i woudl really prefer if there was just one mdev type that repsented the minimal allcatable unit and the aggragaotr where used to create compostions of that. i.e instad of i915-GVTg_V5_2 beign half the device, have 1 mdev type i915-GVTg and if the device support 8 of them then we can aggrate 4 of i915-GVTg
if you want to have muplie mdev type to model the different amoutn of the resouce e.g. i915-GVTg_small i915-GVTg_large that is totlaly fine too or even i915-GVTg_4 indcating it sis 4 of i915-GVTg
failing that i would just expose an mdev type per composable resouce and allow us to compose them a the user level with some other construct mudeling a attament to the device. e.g. create composed mdev or somethig that is an aggreateion of multiple sub resouces each of which is an mdev. so kind of like how bond port work. we would create an mdev for each of the sub resouces and then create a bond or aggrated mdev by reference the other mdevs by uuid then attach only the aggreated mdev to the instance.
the current aggrator syntax and sematic however make me rather uncofrotable when i think about orchestating vms on top of it even to boot them let alone migrate them.
So, we explicitly list out self/compatible attributes, and management tools only need to check if self attributes is contained compatible attributes.
or do you mean only compatible list is enough, and the management tools need to find out self list by themselves? But I think provide a self list is easier for management tools.
Thanks Yan
On Thu, Aug 20, 2020 at 02:29:07AM +0100, Sean Mooney wrote:
On Thu, 2020-08-20 at 08:39 +0800, Yan Zhao wrote:
On Tue, Aug 18, 2020 at 11:36:52AM +0200, Cornelia Huck wrote:
On Tue, 18 Aug 2020 10:16:28 +0100 Daniel P. Berrangé berrange@redhat.com wrote:
On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote:
On 2020/8/18 下午4:55, Daniel P. Berrangé wrote:
On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote:
On 2020/8/14 下午1:16, Yan Zhao wrote:
On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:
On 2020/8/10 下午3:46, Yan Zhao wrote: we actually can also retrieve the same information through sysfs, .e.g
|- [path to device] |--- migration | |--- self | | |---device_api | | |---mdev_type | | |---software_version | | |---device_id | | |---aggregator | |--- compatible | | |---device_api | | |---mdev_type | | |---software_version | | |---device_id | | |---aggregator
Yes but:
- You need one file per attribute (one syscall for one attribute)
- Attribute is coupled with kobject
Is that really that bad? You have the device with an embedded kobject anyway, and you can just put things into an attribute group?
[Also, I think that self/compatible split in the example makes things needlessly complex. Shouldn't semantic versioning and matching already cover nearly everything? I would expect very few cases that are more complex than that. Maybe the aggregation stuff, but I don't think we need that self/compatible split for that, either.]
Hi Cornelia,
The reason I want to declare compatible list of attributes is that sometimes it's not a simple 1:1 matching of source attributes and target attributes as I demonstrated below, source mdev of (mdev_type i915-GVTg_V5_2 + aggregator 1) is compatible to target mdev of (mdev_type i915-GVTg_V5_4 + aggregator 2), (mdev_type i915-GVTg_V5_8 + aggregator 4)
the way you are doing the nameing is till really confusing by the way if this has not already been merged in the kernel can you chagne the mdev so that mdev_type i915-GVTg_V5_2 is 2 of mdev_type i915-GVTg_V5_1 instead of half the device
currently you need to deived the aggratod by the number at the end of the mdev type to figure out how much of the phsicial device is being used with is a very unfridly api convention
the way aggrator are being proposed in general is not really someting i like but i thin this at least is something that should be able to correct.
with the complexity in the mdev type name + aggrator i suspect that this will never be support in openstack nova directly requireing integration via cyborg unless we can pre partion the device in to mdevs staicaly and just ignore this.
this is way to vendor sepecif to integrate into something like openstack in nova unless we can guarentee taht how aggreator work will be portable across vendors genericly.
and aggragator may be just one of such examples that 1:1 matching does not fit.
for openstack nova i dont see us support anything beyond the 1:1 case where the mdev type does not change.
hi Sean, I understand it's hard for openstack. but 1:N is always meaningful. e.g. if source device 1 has cap A, it is compatible to device 2: cap A, device 3: cap A+B, device 4: cap A+B+C .... to allow openstack to detect it correctly, in compatible list of device 2, we would say compatible cap is A; device 3, compatible cap is A or A+B; device 4, compatible cap is A or A+B, or A+B+C;
then if openstack finds device A's self cap A is contained in compatible cap of device 2/3/4, it can migrate device 1 to device 2,3,4.
conversely, device 1's compatible cap is only A, so it is able to migrate device 2 to device 1, and it is not able to migrate device 3/4 to device 1.
Thanks Yan
i woudl really prefer if there was just one mdev type that repsented the minimal allcatable unit and the aggragaotr where used to create compostions of that. i.e instad of i915-GVTg_V5_2 beign half the device, have 1 mdev type i915-GVTg and if the device support 8 of them then we can aggrate 4 of i915-GVTg
if you want to have muplie mdev type to model the different amoutn of the resouce e.g. i915-GVTg_small i915-GVTg_large that is totlaly fine too or even i915-GVTg_4 indcating it sis 4 of i915-GVTg
failing that i would just expose an mdev type per composable resouce and allow us to compose them a the user level with some other construct mudeling a attament to the device. e.g. create composed mdev or somethig that is an aggreateion of multiple sub resouces each of which is an mdev. so kind of like how bond port work. we would create an mdev for each of the sub resouces and then create a bond or aggrated mdev by reference the other mdevs by uuid then attach only the aggreated mdev to the instance.
the current aggrator syntax and sematic however make me rather uncofrotable when i think about orchestating vms on top of it even to boot them let alone migrate them.
So, we explicitly list out self/compatible attributes, and management tools only need to check if self attributes is contained compatible attributes.
or do you mean only compatible list is enough, and the management tools need to find out self list by themselves? But I think provide a self list is easier for management tools.
Thanks Yan
On Thu, 2020-08-20 at 12:01 +0800, Yan Zhao wrote:
On Thu, Aug 20, 2020 at 02:29:07AM +0100, Sean Mooney wrote:
On Thu, 2020-08-20 at 08:39 +0800, Yan Zhao wrote:
On Tue, Aug 18, 2020 at 11:36:52AM +0200, Cornelia Huck wrote:
On Tue, 18 Aug 2020 10:16:28 +0100 Daniel P. Berrangé berrange@redhat.com wrote:
On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote:
On 2020/8/18 下午4:55, Daniel P. Berrangé wrote:
On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote:
On 2020/8/14 下午1:16, Yan Zhao wrote:
On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:
On 2020/8/10 下午3:46, Yan Zhao wrote: we actually can also retrieve the same information through sysfs, .e.g
|- [path to device] |--- migration | |--- self | | |---device_api | | |---mdev_type | | |---software_version | | |---device_id | | |---aggregator | |--- compatible | | |---device_api | | |---mdev_type | | |---software_version | | |---device_id | | |---aggregator
Yes but:
- You need one file per attribute (one syscall for one attribute)
- Attribute is coupled with kobject
Is that really that bad? You have the device with an embedded kobject anyway, and you can just put things into an attribute group?
[Also, I think that self/compatible split in the example makes things needlessly complex. Shouldn't semantic versioning and matching already cover nearly everything? I would expect very few cases that are more complex than that. Maybe the aggregation stuff, but I don't think we need that self/compatible split for that, either.]
Hi Cornelia,
The reason I want to declare compatible list of attributes is that sometimes it's not a simple 1:1 matching of source attributes and target attributes as I demonstrated below, source mdev of (mdev_type i915-GVTg_V5_2 + aggregator 1) is compatible to target mdev of (mdev_type i915-GVTg_V5_4 + aggregator 2), (mdev_type i915-GVTg_V5_8 + aggregator 4)
the way you are doing the nameing is till really confusing by the way if this has not already been merged in the kernel can you chagne the mdev so that mdev_type i915-GVTg_V5_2 is 2 of mdev_type i915-GVTg_V5_1 instead of half the device
currently you need to deived the aggratod by the number at the end of the mdev type to figure out how much of the phsicial device is being used with is a very unfridly api convention
the way aggrator are being proposed in general is not really someting i like but i thin this at least is something that should be able to correct.
with the complexity in the mdev type name + aggrator i suspect that this will never be support in openstack nova directly requireing integration via cyborg unless we can pre partion the device in to mdevs staicaly and just ignore this.
this is way to vendor sepecif to integrate into something like openstack in nova unless we can guarentee taht how aggreator work will be portable across vendors genericly.
and aggragator may be just one of such examples that 1:1 matching does not fit.
for openstack nova i dont see us support anything beyond the 1:1 case where the mdev type does not change.
hi Sean, I understand it's hard for openstack. but 1:N is always meaningful. e.g. if source device 1 has cap A, it is compatible to device 2: cap A, device 3: cap A+B, device 4: cap A+B+C .... to allow openstack to detect it correctly, in compatible list of device 2, we would say compatible cap is A; device 3, compatible cap is A or A+B; device 4, compatible cap is A or A+B, or A+B+C;
then if openstack finds device A's self cap A is contained in compatible cap of device 2/3/4, it can migrate device 1 to device 2,3,4.
conversely, device 1's compatible cap is only A, so it is able to migrate device 2 to device 1, and it is not able to migrate device 3/4 to device 1.
yes we build the palcement servce aroudn the idea of capablites as traits on resocue providres. which is why i originally asked if we coudl model compatibality with feature flags
we can seaislyt model deivce as aupport A, A+B or A+B+C and then select hosts and evice based on that but
the list of compatable deivce you are propsoeing hide this feature infomation which whould be what we are matching on.
give me a lset of feature you want and list ting the feature avaiable on each device allow highre level ocestation to easily match the request to a host that can fulllfile it btu thave a set of other compatihble device does not help with that
so if a simple list a capabliteis can be advertiese d and if we know tha two dievce with the same capablity are intercahangebale that is workabout i suspect that will not be the case however and it would onely work within a familay of mdevs that are closely related. which i think agian is an argument for not changeing the mdev type and at least intially only look at migatreion where the mdev type doee not change initally.
Thanks Yan
i woudl really prefer if there was just one mdev type that repsented the minimal allcatable unit and the aggragaotr where used to create compostions of that. i.e instad of i915-GVTg_V5_2 beign half the device, have 1 mdev type i915-GVTg and if the device support 8 of them then we can aggrate 4 of i915-GVTg
if you want to have muplie mdev type to model the different amoutn of the resouce e.g. i915-GVTg_small i915- GVTg_large that is totlaly fine too or even i915-GVTg_4 indcating it sis 4 of i915-GVTg
failing that i would just expose an mdev type per composable resouce and allow us to compose them a the user level with some other construct mudeling a attament to the device. e.g. create composed mdev or somethig that is an aggreateion of multiple sub resouces each of which is an mdev. so kind of like how bond port work. we would create an mdev for each of the sub resouces and then create a bond or aggrated mdev by reference the other mdevs by uuid then attach only the aggreated mdev to the instance.
the current aggrator syntax and sematic however make me rather uncofrotable when i think about orchestating vms on top of it even to boot them let alone migrate them.
So, we explicitly list out self/compatible attributes, and management tools only need to check if self attributes is contained compatible attributes.
or do you mean only compatible list is enough, and the management tools need to find out self list by themselves? But I think provide a self list is easier for management tools.
Thanks Yan
On Thu, Aug 20, 2020 at 06:16:28AM +0100, Sean Mooney wrote:
On Thu, 2020-08-20 at 12:01 +0800, Yan Zhao wrote:
On Thu, Aug 20, 2020 at 02:29:07AM +0100, Sean Mooney wrote:
On Thu, 2020-08-20 at 08:39 +0800, Yan Zhao wrote:
On Tue, Aug 18, 2020 at 11:36:52AM +0200, Cornelia Huck wrote:
On Tue, 18 Aug 2020 10:16:28 +0100 Daniel P. Berrangé berrange@redhat.com wrote:
On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote: > On 2020/8/18 下午4:55, Daniel P. Berrangé wrote: > > On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote: > > On 2020/8/14 下午1:16, Yan Zhao wrote: > > On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote: > > On 2020/8/10 下午3:46, Yan Zhao wrote: > we actually can also retrieve the same information through sysfs, .e.g > > |- [path to device] > |--- migration > | |--- self > | | |---device_api > | | |---mdev_type > | | |---software_version > | | |---device_id > | | |---aggregator > | |--- compatible > | | |---device_api > | | |---mdev_type > | | |---software_version > | | |---device_id > | | |---aggregator > > > Yes but: > > - You need one file per attribute (one syscall for one attribute) > - Attribute is coupled with kobject
Is that really that bad? You have the device with an embedded kobject anyway, and you can just put things into an attribute group?
[Also, I think that self/compatible split in the example makes things needlessly complex. Shouldn't semantic versioning and matching already cover nearly everything? I would expect very few cases that are more complex than that. Maybe the aggregation stuff, but I don't think we need that self/compatible split for that, either.]
Hi Cornelia,
The reason I want to declare compatible list of attributes is that sometimes it's not a simple 1:1 matching of source attributes and target attributes as I demonstrated below, source mdev of (mdev_type i915-GVTg_V5_2 + aggregator 1) is compatible to target mdev of (mdev_type i915-GVTg_V5_4 + aggregator 2), (mdev_type i915-GVTg_V5_8 + aggregator 4)
the way you are doing the nameing is till really confusing by the way if this has not already been merged in the kernel can you chagne the mdev so that mdev_type i915-GVTg_V5_2 is 2 of mdev_type i915-GVTg_V5_1 instead of half the device
currently you need to deived the aggratod by the number at the end of the mdev type to figure out how much of the phsicial device is being used with is a very unfridly api convention
the way aggrator are being proposed in general is not really someting i like but i thin this at least is something that should be able to correct.
with the complexity in the mdev type name + aggrator i suspect that this will never be support in openstack nova directly requireing integration via cyborg unless we can pre partion the device in to mdevs staicaly and just ignore this.
this is way to vendor sepecif to integrate into something like openstack in nova unless we can guarentee taht how aggreator work will be portable across vendors genericly.
and aggragator may be just one of such examples that 1:1 matching does not fit.
for openstack nova i dont see us support anything beyond the 1:1 case where the mdev type does not change.
hi Sean, I understand it's hard for openstack. but 1:N is always meaningful. e.g. if source device 1 has cap A, it is compatible to device 2: cap A, device 3: cap A+B, device 4: cap A+B+C .... to allow openstack to detect it correctly, in compatible list of device 2, we would say compatible cap is A; device 3, compatible cap is A or A+B; device 4, compatible cap is A or A+B, or A+B+C;
then if openstack finds device A's self cap A is contained in compatible cap of device 2/3/4, it can migrate device 1 to device 2,3,4.
conversely, device 1's compatible cap is only A, so it is able to migrate device 2 to device 1, and it is not able to migrate device 3/4 to device 1.
yes we build the palcement servce aroudn the idea of capablites as traits on resocue providres. which is why i originally asked if we coudl model compatibality with feature flags
we can seaislyt model deivce as aupport A, A+B or A+B+C and then select hosts and evice based on that but
the list of compatable deivce you are propsoeing hide this feature infomation which whould be what we are matching on.
give me a lset of feature you want and list ting the feature avaiable on each device allow highre level ocestation to easily match the request to a host that can fulllfile it btu thave a set of other compatihble device does not help with that
so if a simple list a capabliteis can be advertiese d and if we know tha two dievce with the same capablity are intercahangebale that is workabout i suspect that will not be the case however and it would onely work within a familay of mdevs that are closely related. which i think agian is an argument for not changeing the mdev type and at least intially only look at migatreion where the mdev type doee not change initally.
sorry Sean, I don't understand your words completely. Please allow me to write it down in my words, and please confirm if my understanding is right. 1. you mean you agree on that each field is regarded as a trait, and openstack can compare by itself if source trait is a subset of target trait, right? e.g. source device field1=A1 field2=A2+B2 field3=A3
target device field1=A1+B1 field2=A2+B2 filed3=A3
then openstack sees that field1/2/3 in source is a subset of field1/2/3 in target, so it's migratable to target?
2. mdev_type + aggregator make it hard to achieve the above elegant solution, so it's best to avoid the combined comparing of mdev_type + aggregator. do I understand it correctly?
3. you don't like self list and compatible list, because it is hard for openstack to compare different traits? e.g. if we have self list and compatible list, then as below, openstack needs to compare if self field1/2/3 is a subset of compatible field 1/2/3.
source device: self field1=A1 self field2=A2+B2 self field3=A3
compatible field1=A1 compatible field2=A2;B2;A2+B2; compatible field3=A3
target device: self field1=A1+B1 self field2=A2+B2 self field3=A3
compatible field1=A1;B1;A1+B1; compatible field2=A2;B2;A2+B2; compatible field3=A3
Thanks Yan
i woudl really prefer if there was just one mdev type that repsented the minimal allcatable unit and the aggragaotr where used to create compostions of that. i.e instad of i915-GVTg_V5_2 beign half the device, have 1 mdev type i915-GVTg and if the device support 8 of them then we can aggrate 4 of i915-GVTg
if you want to have muplie mdev type to model the different amoutn of the resouce e.g. i915-GVTg_small i915- GVTg_large that is totlaly fine too or even i915-GVTg_4 indcating it sis 4 of i915-GVTg
failing that i would just expose an mdev type per composable resouce and allow us to compose them a the user level with some other construct mudeling a attament to the device. e.g. create composed mdev or somethig that is an aggreateion of multiple sub resouces each of which is an mdev. so kind of like how bond port work. we would create an mdev for each of the sub resouces and then create a bond or aggrated mdev by reference the other mdevs by uuid then attach only the aggreated mdev to the instance.
the current aggrator syntax and sematic however make me rather uncofrotable when i think about orchestating vms on top of it even to boot them let alone migrate them.
So, we explicitly list out self/compatible attributes, and management tools only need to check if self attributes is contained compatible attributes.
or do you mean only compatible list is enough, and the management tools need to find out self list by themselves? But I think provide a self list is easier for management tools.
Thanks Yan
On Thu, 2020-08-20 at 14:27 +0800, Yan Zhao wrote:
On Thu, Aug 20, 2020 at 06:16:28AM +0100, Sean Mooney wrote:
On Thu, 2020-08-20 at 12:01 +0800, Yan Zhao wrote:
On Thu, Aug 20, 2020 at 02:29:07AM +0100, Sean Mooney wrote:
On Thu, 2020-08-20 at 08:39 +0800, Yan Zhao wrote:
On Tue, Aug 18, 2020 at 11:36:52AM +0200, Cornelia Huck wrote:
On Tue, 18 Aug 2020 10:16:28 +0100 Daniel P. Berrangé berrange@redhat.com wrote:
> On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote: > > On 2020/8/18 下午4:55, Daniel P. Berrangé wrote: > > > > On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote: > > > > On 2020/8/14 下午1:16, Yan Zhao wrote: > > > > On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote: > > > > On 2020/8/10 下午3:46, Yan Zhao wrote: > > we actually can also retrieve the same information through sysfs, .e.g > > > > |- [path to device] > > |--- migration > > | |--- self > > | | |---device_api > > | | |---mdev_type > > | | |---software_version > > | | |---device_id > > | | |---aggregator > > | |--- compatible > > | | |---device_api > > | | |---mdev_type > > | | |---software_version > > | | |---device_id > > | | |---aggregator > > > > > > Yes but: > > > > - You need one file per attribute (one syscall for one attribute) > > - Attribute is coupled with kobject
Is that really that bad? You have the device with an embedded kobject anyway, and you can just put things into an attribute group?
[Also, I think that self/compatible split in the example makes things needlessly complex. Shouldn't semantic versioning and matching already cover nearly everything? I would expect very few cases that are more complex than that. Maybe the aggregation stuff, but I don't think we need that self/compatible split for that, either.]
Hi Cornelia,
The reason I want to declare compatible list of attributes is that sometimes it's not a simple 1:1 matching of source attributes and target attributes as I demonstrated below, source mdev of (mdev_type i915-GVTg_V5_2 + aggregator 1) is compatible to target mdev of (mdev_type i915-GVTg_V5_4 + aggregator 2), (mdev_type i915-GVTg_V5_8 + aggregator 4)
the way you are doing the nameing is till really confusing by the way if this has not already been merged in the kernel can you chagne the mdev so that mdev_type i915-GVTg_V5_2 is 2 of mdev_type i915-GVTg_V5_1 instead of half the device
currently you need to deived the aggratod by the number at the end of the mdev type to figure out how much of the phsicial device is being used with is a very unfridly api convention
the way aggrator are being proposed in general is not really someting i like but i thin this at least is something that should be able to correct.
with the complexity in the mdev type name + aggrator i suspect that this will never be support in openstack nova directly requireing integration via cyborg unless we can pre partion the device in to mdevs staicaly and just ignore this.
this is way to vendor sepecif to integrate into something like openstack in nova unless we can guarentee taht how aggreator work will be portable across vendors genericly.
and aggragator may be just one of such examples that 1:1 matching does not fit.
for openstack nova i dont see us support anything beyond the 1:1 case where the mdev type does not change.
hi Sean, I understand it's hard for openstack. but 1:N is always meaningful. e.g. if source device 1 has cap A, it is compatible to device 2: cap A, device 3: cap A+B, device 4: cap A+B+C .... to allow openstack to detect it correctly, in compatible list of device 2, we would say compatible cap is A; device 3, compatible cap is A or A+B; device 4, compatible cap is A or A+B, or A+B+C;
then if openstack finds device A's self cap A is contained in compatible cap of device 2/3/4, it can migrate device 1 to device 2,3,4.
conversely, device 1's compatible cap is only A, so it is able to migrate device 2 to device 1, and it is not able to migrate device 3/4 to device 1.
yes we build the palcement servce aroudn the idea of capablites as traits on resocue providres. which is why i originally asked if we coudl model compatibality with feature flags
we can seaislyt model deivce as aupport A, A+B or A+B+C and then select hosts and evice based on that but
the list of compatable deivce you are propsoeing hide this feature infomation which whould be what we are matching on.
give me a lset of feature you want and list ting the feature avaiable on each device allow highre level ocestation to easily match the request to a host that can fulllfile it btu thave a set of other compatihble device does not help with that
so if a simple list a capabliteis can be advertiese d and if we know tha two dievce with the same capablity are intercahangebale that is workabout i suspect that will not be the case however and it would onely work within a familay of mdevs that are closely related. which i think agian is an argument for not changeing the mdev type and at least intially only look at migatreion where the mdev type doee not change initally.
sorry Sean, I don't understand your words completely. Please allow me to write it down in my words, and please confirm if my understanding is right.
- you mean you agree on that each field is regarded as a trait, and
openstack can compare by itself if source trait is a subset of target trait, right? e.g. source device field1=A1 field2=A2+B2 field3=A3
target device field1=A1+B1 field2=A2+B2 filed3=A3
then openstack sees that field1/2/3 in source is a subset of field1/2/3 in target, so it's migratable to target?
yes this is basically how cpu feature work. if we see the host cpu on the dest is a supperset of the cpu feature used by the vm we know its safe to migrate.
- mdev_type + aggregator make it hard to achieve the above elegant
solution, so it's best to avoid the combined comparing of mdev_type + aggregator. do I understand it correctly?
yes and no. one of the challange that mdevs pose right now is that sometiem mdev model independent resouces and sometimes multipe mdev types consume the same underlying resouces there is know way for openstack to know if i915-GVTg_V5_2 and i915-GVTg_V5_4 consume the same resouces or not. as such we cant do the accounting properly so i would much prefer to have just 1 mdev type i915-GVTg and which models the minimal allocatable unit and then say i want 4 of them comsed into 1 device then have a second mdev type that does that since
what that means in pratice is we cannot trust the available_instances for a given mdev type as consuming a different mdev type might change it. aggrators makes that problem worse. which is why i siad i would prefer if instead of aggreator as prposed each consumable resouce was reported indepenedly as different mdev types and then we composed those like we would when bond ports creating an attachment or other logical aggration that refers to instance of mdevs of differing type which we expose as a singel mdev that is exposed to the guest. in a concreate example we might say create a aggreator of 64 cuda cores and 32 tensor cores and "bond them" or aggrate them as a single attachme mdev and provide that to a ml workload guest. a differnt guest could request 1 instace of the nvenc video encoder and one instance of the nvenc video decoder but no cuda or tensor for a video transcoding workload.
if each of those componets are indepent mdev types and can be composed with that granularity then i think that approch is better then the current aggreator with vendor sepcific fileds. we can model the phsical device as being multipel nested resouces with different traits for each type of resouce and different capsities for the same. we can even model how many of the attachments/compositions can be done indepently if there is a limit on that.
|- [parent physical device] |--- Vendor-specific-attributes [optional] |--- [mdev_supported_types] | |--- [<type-id>] | | |--- create | | |--- name | | |--- available_instances | | |--- device_api | | |--- description | | |--- [devices] | |--- [<type-id>] | | |--- create | | |--- name | | |--- available_instances | | |--- device_api | | |--- description | | |--- [devices] | |--- [<type-id>] | |--- create | |--- name | |--- available_instances | |--- device_api | |--- description | |--- [devices]
a benifit of this appoch is we would be the mdev types would not change on migration and we could jsut compuare a a simeple version stirgh and feature flag list to determin comaptiablity in a vendor neutral way. i dont nessisarly need to know what the vendeor flags mean just that the dest is a subset of the source and that the semaitic version numbers say the mdevs are compatible.
- you don't like self list and compatible list, because it is hard for
openstack to compare different traits? e.g. if we have self list and compatible list, then as below, openstack needs to compare if self field1/2/3 is a subset of compatible field 1/2/3.
currnetly we only use mdevs for vGPUs and in our documentaiton we tell customer to model the mdev_type as a trait and request it as a reuiqred trait. so for customer that are doing that today changing mdev types is not really an option. we would prefer that they request the feature they need instead of a spefic mdev type so we can select any that meets there needs for example we have a bunch of traits for cuda support https://github.com/openstack/os-traits/blob/master/os_traits/hw/gpu/cuda.py or driectx/vulkan/opengl https://github.com/openstack/os-traits/blob/master/os_traits/hw/gpu/api.py these are closely analogous to cpu feature flag lix avx or sse https://github.com/openstack/os-traits/blob/master/os_traits/hw/cpu/x86/__in...
so when it comes to compatiablities it would be ideal if you could express capablities as something like a cpu feature flag then we can eaisly model those as traits.
source device: self field1=A1 self field2=A2+B2 self field3=A3
compatible field1=A1 compatible field2=A2;B2;A2+B2; compatible field3=A3
target device: self field1=A1+B1 self field2=A2+B2 self field3=A3
compatible field1=A1;B1;A1+B1; compatible field2=A2;B2;A2+B2; compatible field3=A3
Thanks Yan
i woudl really prefer if there was just one mdev type that repsented the minimal allcatable unit and the aggragaotr where used to create compostions of that. i.e instad of i915-GVTg_V5_2 beign half the device, have 1 mdev type i915-GVTg and if the device support 8 of them then we can aggrate 4 of i915-GVTg
if you want to have muplie mdev type to model the different amoutn of the resouce e.g. i915-GVTg_small i915- GVTg_large that is totlaly fine too or even i915-GVTg_4 indcating it sis 4 of i915-GVTg
failing that i would just expose an mdev type per composable resouce and allow us to compose them a the user level with some other construct mudeling a attament to the device. e.g. create composed mdev or somethig that is an aggreateion of multiple sub resouces each of which is an mdev. so kind of like how bond port work. we would create an mdev for each of the sub resouces and then create a bond or aggrated mdev by reference the other mdevs by uuid then attach only the aggreated mdev to the instance.
the current aggrator syntax and sematic however make me rather uncofrotable when i think about orchestating vms on top of it even to boot them let alone migrate them.
So, we explicitly list out self/compatible attributes, and management tools only need to check if self attributes is contained compatible attributes.
or do you mean only compatible list is enough, and the management tools need to find out self list by themselves? But I think provide a self list is easier for management tools.
Thanks Yan
On Thu, Aug 20, 2020 at 02:24:26PM +0100, Sean Mooney wrote:
On Thu, 2020-08-20 at 14:27 +0800, Yan Zhao wrote:
On Thu, Aug 20, 2020 at 06:16:28AM +0100, Sean Mooney wrote:
On Thu, 2020-08-20 at 12:01 +0800, Yan Zhao wrote:
On Thu, Aug 20, 2020 at 02:29:07AM +0100, Sean Mooney wrote:
On Thu, 2020-08-20 at 08:39 +0800, Yan Zhao wrote:
On Tue, Aug 18, 2020 at 11:36:52AM +0200, Cornelia Huck wrote: > On Tue, 18 Aug 2020 10:16:28 +0100 > Daniel P. Berrangé berrange@redhat.com wrote: > > > On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote: > > > On 2020/8/18 下午4:55, Daniel P. Berrangé wrote: > > > > > > On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote: > > > > > > On 2020/8/14 下午1:16, Yan Zhao wrote: > > > > > > On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote: > > > > > > On 2020/8/10 下午3:46, Yan Zhao wrote: > > > we actually can also retrieve the same information through sysfs, .e.g > > > > > > |- [path to device] > > > |--- migration > > > | |--- self > > > | | |---device_api > > > | | |---mdev_type > > > | | |---software_version > > > | | |---device_id > > > | | |---aggregator > > > | |--- compatible > > > | | |---device_api > > > | | |---mdev_type > > > | | |---software_version > > > | | |---device_id > > > | | |---aggregator > > > > > > > > > Yes but: > > > > > > - You need one file per attribute (one syscall for one attribute) > > > - Attribute is coupled with kobject > > Is that really that bad? You have the device with an embedded kobject > anyway, and you can just put things into an attribute group? > > [Also, I think that self/compatible split in the example makes things > needlessly complex. Shouldn't semantic versioning and matching already > cover nearly everything? I would expect very few cases that are more > complex than that. Maybe the aggregation stuff, but I don't think we > need that self/compatible split for that, either.]
Hi Cornelia,
The reason I want to declare compatible list of attributes is that sometimes it's not a simple 1:1 matching of source attributes and target attributes as I demonstrated below, source mdev of (mdev_type i915-GVTg_V5_2 + aggregator 1) is compatible to target mdev of (mdev_type i915-GVTg_V5_4 + aggregator 2), (mdev_type i915-GVTg_V5_8 + aggregator 4)
the way you are doing the nameing is till really confusing by the way if this has not already been merged in the kernel can you chagne the mdev so that mdev_type i915-GVTg_V5_2 is 2 of mdev_type i915-GVTg_V5_1 instead of half the device
currently you need to deived the aggratod by the number at the end of the mdev type to figure out how much of the phsicial device is being used with is a very unfridly api convention
the way aggrator are being proposed in general is not really someting i like but i thin this at least is something that should be able to correct.
with the complexity in the mdev type name + aggrator i suspect that this will never be support in openstack nova directly requireing integration via cyborg unless we can pre partion the device in to mdevs staicaly and just ignore this.
this is way to vendor sepecif to integrate into something like openstack in nova unless we can guarentee taht how aggreator work will be portable across vendors genericly.
and aggragator may be just one of such examples that 1:1 matching does not fit.
for openstack nova i dont see us support anything beyond the 1:1 case where the mdev type does not change.
hi Sean, I understand it's hard for openstack. but 1:N is always meaningful. e.g. if source device 1 has cap A, it is compatible to device 2: cap A, device 3: cap A+B, device 4: cap A+B+C .... to allow openstack to detect it correctly, in compatible list of device 2, we would say compatible cap is A; device 3, compatible cap is A or A+B; device 4, compatible cap is A or A+B, or A+B+C;
then if openstack finds device A's self cap A is contained in compatible cap of device 2/3/4, it can migrate device 1 to device 2,3,4.
conversely, device 1's compatible cap is only A, so it is able to migrate device 2 to device 1, and it is not able to migrate device 3/4 to device 1.
yes we build the palcement servce aroudn the idea of capablites as traits on resocue providres. which is why i originally asked if we coudl model compatibality with feature flags
we can seaislyt model deivce as aupport A, A+B or A+B+C and then select hosts and evice based on that but
the list of compatable deivce you are propsoeing hide this feature infomation which whould be what we are matching on.
give me a lset of feature you want and list ting the feature avaiable on each device allow highre level ocestation to easily match the request to a host that can fulllfile it btu thave a set of other compatihble device does not help with that
so if a simple list a capabliteis can be advertiese d and if we know tha two dievce with the same capablity are intercahangebale that is workabout i suspect that will not be the case however and it would onely work within a familay of mdevs that are closely related. which i think agian is an argument for not changeing the mdev type and at least intially only look at migatreion where the mdev type doee not change initally.
sorry Sean, I don't understand your words completely. Please allow me to write it down in my words, and please confirm if my understanding is right.
- you mean you agree on that each field is regarded as a trait, and
openstack can compare by itself if source trait is a subset of target trait, right? e.g. source device field1=A1 field2=A2+B2 field3=A3
target device field1=A1+B1 field2=A2+B2 filed3=A3
then openstack sees that field1/2/3 in source is a subset of field1/2/3 in target, so it's migratable to target?
yes this is basically how cpu feature work. if we see the host cpu on the dest is a supperset of the cpu feature used by the vm we know its safe to migrate.
got it. glad to know it :)
- mdev_type + aggregator make it hard to achieve the above elegant
solution, so it's best to avoid the combined comparing of mdev_type + aggregator. do I understand it correctly?
yes and no. one of the challange that mdevs pose right now is that sometiem mdev model independent resouces and sometimes multipe mdev types consume the same underlying resouces there is know way for openstack to know if i915-GVTg_V5_2 and i915-GVTg_V5_4 consume the same resouces or not. as such we cant do the accounting properly so i would much prefer to have just 1 mdev type i915-GVTg and which models the minimal allocatable unit and then say i want 4 of them comsed into 1 device then have a second mdev type that does that since
what that means in pratice is we cannot trust the available_instances for a given mdev type as consuming a different mdev type might change it. aggrators makes that problem worse. which is why i siad i would prefer if instead of aggreator as prposed each consumable resouce was reported indepenedly as different mdev types and then we composed those like we would when bond ports creating an attachment or other logical aggration that refers to instance of mdevs of differing type which we expose as a singel mdev that is exposed to the guest. in a concreate example we might say create a aggreator of 64 cuda cores and 32 tensor cores and "bond them" or aggrate them as a single attachme mdev and provide that to a ml workload guest. a differnt guest could request 1 instace of the nvenc video encoder and one instance of the nvenc video decoder but no cuda or tensor for a video transcoding workload.
The "bond" you described is a little different from the intension of the aggregator we introduced for scalable IOV. (as explained in another mail to Cornelia https://lists.gnu.org/archive/html/qemu-devel/2020-08/msg06523.html).
But any way, we agree that mdevs are not compatible if mdev_types are not compatible.
if each of those componets are indepent mdev types and can be composed with that granularity then i think that approch is better then the current aggreator with vendor sepcific fileds. we can model the phsical device as being multipel nested resouces with different traits for each type of resouce and different capsities for the same. we can even model how many of the attachments/compositions can be done indepently if there is a limit on that.
|- [parent physical device] |--- Vendor-specific-attributes [optional] |--- [mdev_supported_types] | |--- [<type-id>] | | |--- create | | |--- name | | |--- available_instances | | |--- device_api | | |--- description | | |--- [devices] | |--- [<type-id>] | | |--- create | | |--- name | | |--- available_instances | | |--- device_api | | |--- description | | |--- [devices] | |--- [<type-id>] | |--- create | |--- name | |--- available_instances | |--- device_api | |--- description | |--- [devices]
a benifit of this appoch is we would be the mdev types would not change on migration and we could jsut compuare a a simeple version stirgh and feature flag list to determin comaptiablity in a vendor neutral way. i dont nessisarly need to know what the vendeor flags mean just that the dest is a subset of the source and that the semaitic version numbers say the mdevs are compatible.
as aggregator and some other attributes are only meaningful after devices are created, and vendors' naming of mdev types are not unified, do you think below way is good?
|- [parent physical device] |--- [mdev_supported_types] | |--- [<type-id>] | | |--- create | | |--- name | | |--- available_instances | | |--- compatible_type [must] | | |--- Vendor-specific-compatible-type-attributes [optional] | | |--- device_api [must] | | |--- software_version [must] | | |--- description | | |--- [devices] | | |--------[<uuid>] | | | |--- vendor-specific-compatible-device-attriutes [optional]
all vendor specific compatible attributes begin with compatible in name.
in GVT's current case, |- 0000:00:02.0 |--- mdev_supported_types | |--- i915-GVTg_V5_8 | | |--- create | | |--- name | | |--- available_instances | | |--- compatible_type : i915-GVTg_V5_8, i915-GVTg_V4_8 | | |--- device_api : vfio-pci | | |--- software_version : 1.0.0 | | |--- compatible_pci_ids : 5931, 591b | | |--- description | | |--- devices | | | |- 882cc4da-dede-11e7-9180-078a62063ab1 | | | | | --- aggregator : 1 | | | | | --- compatible_aggregator : 1
suppose 882cc4da-dede-11e7-9180-078a62063ab1 is a src mdev. the sequence for openstack to find a compatible mdev in my mind is that 1. make src mdev type and compatible_type as traits.
2. look for a mdev type that is either i915-GVTg_V4_8 or i915-GVTg_V5_8 as that in compatible_type. (this is just an example, currently we only support migration between mdevs whose attributes are all matching, from mdev type to aggregator, to pci_ids)
3. if 2 fails, try to find a mdev type whose compatible_type is a superset of src compatible_type. if found one, go to step 4; otherwise, quit.
4. check if device_api, software_version under the type are compatible.
5. check if other vendor specific type attributes under the type are compatible. - check if src compatible_pci_ids is a subset of target compatible_pci_ids.
6. check if device is created and not occupied, if not, create one.
7. check if vendor specific attributes under the device are compatible. - check if src compatible_aggregator is a subset of target compatible_aggregator. if fails, try to find counterpart attribute of vendor specific device attribute and set target value according to compatible_xxx in source side. (for compatible_aggregator, its counterpart is aggregator.) if attribute aggregator exists, step 7 succeeds when setting of its value succeeds. if attribute aggregator does not exist, step 7 fails.
8. a compatible target is found.
not sure if the above steps look good to you.
some changes are required for compatibility check for physical device when mdev_type is absent. but let's first arrive at consensus for mdevs first :)
- you don't like self list and compatible list, because it is hard for
openstack to compare different traits? e.g. if we have self list and compatible list, then as below, openstack needs to compare if self field1/2/3 is a subset of compatible field 1/2/3.
currnetly we only use mdevs for vGPUs and in our documentaiton we tell customer to model the mdev_type as a trait and request it as a reuiqred trait. so for customer that are doing that today changing mdev types is not really an option. we would prefer that they request the feature they need instead of a spefic mdev type so we can select any that meets there needs for example we have a bunch of traits for cuda support https://github.com/openstack/os-traits/blob/master/os_traits/hw/gpu/cuda.py or driectx/vulkan/opengl https://github.com/openstack/os-traits/blob/master/os_traits/hw/gpu/api.py these are closely analogous to cpu feature flag lix avx or sse https://github.com/openstack/os-traits/blob/master/os_traits/hw/cpu/x86/__in...
so when it comes to compatiablities it would be ideal if you could express capablities as something like a cpu feature flag then we can eaisly model those as traits.
source device: self field1=A1 self field2=A2+B2 self field3=A3
compatible field1=A1 compatible field2=A2;B2;A2+B2; compatible field3=A3
target device: self field1=A1+B1 self field2=A2+B2 self field3=A3
compatible field1=A1;B1;A1+B1; compatible field2=A2;B2;A2+B2; compatible field3=A3
Thanks Yan
i woudl really prefer if there was just one mdev type that repsented the minimal allcatable unit and the aggragaotr where used to create compostions of that. i.e instad of i915-GVTg_V5_2 beign half the device, have 1 mdev type i915-GVTg and if the device support 8 of them then we can aggrate 4 of i915-GVTg
if you want to have muplie mdev type to model the different amoutn of the resouce e.g. i915-GVTg_small i915- GVTg_large that is totlaly fine too or even i915-GVTg_4 indcating it sis 4 of i915-GVTg
failing that i would just expose an mdev type per composable resouce and allow us to compose them a the user level with some other construct mudeling a attament to the device. e.g. create composed mdev or somethig that is an aggreateion of multiple sub resouces each of which is an mdev. so kind of like how bond port work. we would create an mdev for each of the sub resouces and then create a bond or aggrated mdev by reference the other mdevs by uuid then attach only the aggreated mdev to the instance.
the current aggrator syntax and sematic however make me rather uncofrotable when i think about orchestating vms on top of it even to boot them let alone migrate them.
So, we explicitly list out self/compatible attributes, and management tools only need to check if self attributes is contained compatible attributes.
or do you mean only compatible list is enough, and the management tools need to find out self list by themselves? But I think provide a self list is easier for management tools.
Thanks Yan
On Thu, 20 Aug 2020 08:39:22 +0800 Yan Zhao yan.y.zhao@intel.com wrote:
On Tue, Aug 18, 2020 at 11:36:52AM +0200, Cornelia Huck wrote:
On Tue, 18 Aug 2020 10:16:28 +0100 Daniel P. Berrangé berrange@redhat.com wrote:
On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote:
On 2020/8/18 下午4:55, Daniel P. Berrangé wrote:
On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote:
On 2020/8/14 下午1:16, Yan Zhao wrote:
On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:
On 2020/8/10 下午3:46, Yan Zhao wrote:
we actually can also retrieve the same information through sysfs, .e.g
|- [path to device] |--- migration | |--- self | | |---device_api | | |---mdev_type | | |---software_version | | |---device_id | | |---aggregator | |--- compatible | | |---device_api | | |---mdev_type | | |---software_version | | |---device_id | | |---aggregator
Yes but:
- You need one file per attribute (one syscall for one attribute)
- Attribute is coupled with kobject
Is that really that bad? You have the device with an embedded kobject anyway, and you can just put things into an attribute group?
[Also, I think that self/compatible split in the example makes things needlessly complex. Shouldn't semantic versioning and matching already cover nearly everything? I would expect very few cases that are more complex than that. Maybe the aggregation stuff, but I don't think we need that self/compatible split for that, either.]
Hi Cornelia,
The reason I want to declare compatible list of attributes is that sometimes it's not a simple 1:1 matching of source attributes and target attributes as I demonstrated below, source mdev of (mdev_type i915-GVTg_V5_2 + aggregator 1) is compatible to target mdev of (mdev_type i915-GVTg_V5_4 + aggregator 2), (mdev_type i915-GVTg_V5_8 + aggregator 4)
and aggragator may be just one of such examples that 1:1 matching does not fit.
If you're suggesting that we need a new 'compatible' set for every aggregation, haven't we lost the purpose of aggregation? For example, rather than having N mdev types to represent all the possible aggregation values, we have a single mdev type with N compatible migration entries, one for each possible aggregation value. BTW, how do we have multiple compatible directories? compatible0001, compatible0002? Thanks,
Alex
On Wed, Aug 19, 2020 at 09:22:34PM -0600, Alex Williamson wrote:
On Thu, 20 Aug 2020 08:39:22 +0800 Yan Zhao yan.y.zhao@intel.com wrote:
On Tue, Aug 18, 2020 at 11:36:52AM +0200, Cornelia Huck wrote:
On Tue, 18 Aug 2020 10:16:28 +0100 Daniel P. Berrangé berrange@redhat.com wrote:
On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote:
On 2020/8/18 下午4:55, Daniel P. Berrangé wrote:
On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote:
On 2020/8/14 下午1:16, Yan Zhao wrote:
On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:
On 2020/8/10 下午3:46, Yan Zhao wrote:
we actually can also retrieve the same information through sysfs, .e.g
|- [path to device] |--- migration | |--- self | | |---device_api | | |---mdev_type | | |---software_version | | |---device_id | | |---aggregator | |--- compatible | | |---device_api | | |---mdev_type | | |---software_version | | |---device_id | | |---aggregator
Yes but:
- You need one file per attribute (one syscall for one attribute)
- Attribute is coupled with kobject
Is that really that bad? You have the device with an embedded kobject anyway, and you can just put things into an attribute group?
[Also, I think that self/compatible split in the example makes things needlessly complex. Shouldn't semantic versioning and matching already cover nearly everything? I would expect very few cases that are more complex than that. Maybe the aggregation stuff, but I don't think we need that self/compatible split for that, either.]
Hi Cornelia,
The reason I want to declare compatible list of attributes is that sometimes it's not a simple 1:1 matching of source attributes and target attributes as I demonstrated below, source mdev of (mdev_type i915-GVTg_V5_2 + aggregator 1) is compatible to target mdev of (mdev_type i915-GVTg_V5_4 + aggregator 2), (mdev_type i915-GVTg_V5_8 + aggregator 4)
and aggragator may be just one of such examples that 1:1 matching does not fit.
If you're suggesting that we need a new 'compatible' set for every aggregation, haven't we lost the purpose of aggregation? For example, rather than having N mdev types to represent all the possible aggregation values, we have a single mdev type with N compatible migration entries, one for each possible aggregation value. BTW, how do we have multiple compatible directories? compatible0001, compatible0002? Thanks,
do you think the bin_attribute I proposed yesterday good? Then we can have a single compatible with a variable in the mdev_type and aggregator.
mdev_type=i915-GVTg_V5_{val1:int:2,4,8} aggregator={val1}/2
Thanks Yan
On Thu, 20 Aug 2020 11:16:21 +0800 Yan Zhao yan.y.zhao@intel.com wrote:
On Wed, Aug 19, 2020 at 09:22:34PM -0600, Alex Williamson wrote:
On Thu, 20 Aug 2020 08:39:22 +0800 Yan Zhao yan.y.zhao@intel.com wrote:
On Tue, Aug 18, 2020 at 11:36:52AM +0200, Cornelia Huck wrote:
On Tue, 18 Aug 2020 10:16:28 +0100 Daniel P. Berrangé berrange@redhat.com wrote:
On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote:
On 2020/8/18 下午4:55, Daniel P. Berrangé wrote:
On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote:
On 2020/8/14 下午1:16, Yan Zhao wrote:
On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:
On 2020/8/10 下午3:46, Yan Zhao wrote:
we actually can also retrieve the same information through sysfs, .e.g
|- [path to device] |--- migration | |--- self | | |---device_api | | |---mdev_type | | |---software_version | | |---device_id | | |---aggregator | |--- compatible | | |---device_api | | |---mdev_type | | |---software_version | | |---device_id | | |---aggregator
Yes but:
- You need one file per attribute (one syscall for one attribute)
- Attribute is coupled with kobject
Is that really that bad? You have the device with an embedded kobject anyway, and you can just put things into an attribute group?
[Also, I think that self/compatible split in the example makes things needlessly complex. Shouldn't semantic versioning and matching already cover nearly everything? I would expect very few cases that are more complex than that. Maybe the aggregation stuff, but I don't think we need that self/compatible split for that, either.]
Hi Cornelia,
The reason I want to declare compatible list of attributes is that sometimes it's not a simple 1:1 matching of source attributes and target attributes as I demonstrated below, source mdev of (mdev_type i915-GVTg_V5_2 + aggregator 1) is compatible to target mdev of (mdev_type i915-GVTg_V5_4 + aggregator 2), (mdev_type i915-GVTg_V5_8 + aggregator 4)
and aggragator may be just one of such examples that 1:1 matching does not fit.
If you're suggesting that we need a new 'compatible' set for every aggregation, haven't we lost the purpose of aggregation? For example, rather than having N mdev types to represent all the possible aggregation values, we have a single mdev type with N compatible migration entries, one for each possible aggregation value. BTW, how do we have multiple compatible directories? compatible0001, compatible0002? Thanks,
do you think the bin_attribute I proposed yesterday good? Then we can have a single compatible with a variable in the mdev_type and aggregator.
mdev_type=i915-GVTg_V5_{val1:int:2,4,8} aggregator={val1}/2
I'm not really a fan of binary attributes other than in cases where we have some kind of binary format to begin with.
IIUC, we basically have: - different partitioning (expressed in the mdev_type) - different number of partitions (expressed via the aggregator) - devices being compatible if the partitioning:aggregator ratio is the same
(The multiple mdev_type variants seem to come from avoiding extra creation parameters, IIRC?)
Would it be enough to export base_type=i915-GVTg_V5 aggregation_ratio=<integer>
to express the various combinations that are compatible without the need for multiple sets of attributes?
On Tue, Aug 25, 2020 at 04:39:25PM +0200, Cornelia Huck wrote: <...>
do you think the bin_attribute I proposed yesterday good? Then we can have a single compatible with a variable in the mdev_type and aggregator.
mdev_type=i915-GVTg_V5_{val1:int:2,4,8} aggregator={val1}/2
I'm not really a fan of binary attributes other than in cases where we have some kind of binary format to begin with.
IIUC, we basically have:
- different partitioning (expressed in the mdev_type)
- different number of partitions (expressed via the aggregator)
- devices being compatible if the partitioning:aggregator ratio is the same
(The multiple mdev_type variants seem to come from avoiding extra creation parameters, IIRC?)
Would it be enough to export base_type=i915-GVTg_V5 aggregation_ratio=<integer>
to express the various combinations that are compatible without the need for multiple sets of attributes?
yes. I agree we need to decouple the mdev type name and aggregator for compatibility detection purpose.
please allow me to put some words to describe the history and motivation of introducing aggregator.
initially, we have fixed mdev_type i915-GVTg_V5_1, i915-GVTg_V5_2, i915-GVTg_V5_4, i915-GVTg_V5_8, the digital after i915-GVTg_V5 representing the max number of instances allowed to be created for this type. They also identify how many resources are to be allocated for each type.
They are so far so good for current intel vgpus, i.e., cutting the physical GPU into several virtual pieces and sharing them among several VMs in pure mediation way. fixed types are provided in advance as we thought it can meet needs from most users and users can know the hardware capability they acquired from the type name. the bigger in number, the smaller piece of physical hardware.
Then, when it comes to scalable IOV in near future, one physical hardware is able to be cut into a large number of units in hardware layer The single unit to be assigned into guest can be very small while one to several units are grouped into an mdev.
The fixed type scheme is then cumbersome. Therefore, a new attribute aggregator is introduced to specify the number of resources to be assigned based on the base resource specified in type name. e.g. if type name is dsa-1dwq, and aggregator is 30, then the assignable resources to guest is 30 wqs in a single created mdev. if type name is dsa-2dwq, and aggregator is 15, then the assignable resources to guest is also 30wqs in a single created mdev. (in this example, the rule to define type name is different to the case in GVT. here 1 wq means wq number is 1. yes, they are current reality. :) )
previously, we want to regard the two mdevs created with dsa-1dwq x 30 and dsa-2dwq x 15 as compatible, because the two mdevs consist equal resources.
But, as it's a burden to upper layer, we agree that if this condition happens, we still treat the two as incompatible.
To fix it, either the driver should expose dsa-1dwq only, or the target dsa-2dwq needs to be destroyed and reallocated via dsa-1dwq x 30.
Does it make sense?
Thanks Yan
On Wed, 26 Aug 2020 14:41:17 +0800 Yan Zhao yan.y.zhao@intel.com wrote:
previously, we want to regard the two mdevs created with dsa-1dwq x 30 and dsa-2dwq x 15 as compatible, because the two mdevs consist equal resources.
But, as it's a burden to upper layer, we agree that if this condition happens, we still treat the two as incompatible.
To fix it, either the driver should expose dsa-1dwq only, or the target dsa-2dwq needs to be destroyed and reallocated via dsa-1dwq x 30.
AFAIU, these are mdev types, aren't they? So, basically, any management software needs to take care to use the matching mdev type on the target system for device creation?
On Fri, 2020-08-28 at 15:47 +0200, Cornelia Huck wrote:
On Wed, 26 Aug 2020 14:41:17 +0800 Yan Zhao yan.y.zhao@intel.com wrote:
previously, we want to regard the two mdevs created with dsa-1dwq x 30 and dsa-2dwq x 15 as compatible, because the two mdevs consist equal resources.
But, as it's a burden to upper layer, we agree that if this condition happens, we still treat the two as incompatible.
To fix it, either the driver should expose dsa-1dwq only, or the target dsa-2dwq needs to be destroyed and reallocated via dsa-1dwq x 30.
AFAIU, these are mdev types, aren't they? So, basically, any management software needs to take care to use the matching mdev type on the target system for device creation?
or just do the simple thing of use the same mdev type on the source and dest. matching mdevtypes is not nessiarly trivial. we could do that but we woudl have to do that in python rather then sql so it would be slower to do at least today.
we dont currently have the ablity to say the resouce provider must have 1 of these set of traits. just that we must have a specific trait. this is a feature we have disucssed a couple of times and delayed untill we really really need it but its not out of the question that we could add it for this usecase. i suspect however we would do exact match first and explore this later after the inital mdev migration works.
by the way i was looking at some vdpa reslated matiail today and noticed vdpa devices are nolonger usign mdevs and and now use a vhost chardev so i guess we will need a completely seperate mechanioum for vdpa vs mdev migration as a result. that is rather unfortunet but i guess that is life.
On Fri, Aug 28, 2020 at 03:04:12PM +0100, Sean Mooney wrote:
On Fri, 2020-08-28 at 15:47 +0200, Cornelia Huck wrote:
On Wed, 26 Aug 2020 14:41:17 +0800 Yan Zhao yan.y.zhao@intel.com wrote:
previously, we want to regard the two mdevs created with dsa-1dwq x 30 and dsa-2dwq x 15 as compatible, because the two mdevs consist equal resources.
But, as it's a burden to upper layer, we agree that if this condition happens, we still treat the two as incompatible.
To fix it, either the driver should expose dsa-1dwq only, or the target dsa-2dwq needs to be destroyed and reallocated via dsa-1dwq x 30.
AFAIU, these are mdev types, aren't they? So, basically, any management software needs to take care to use the matching mdev type on the target system for device creation?
or just do the simple thing of use the same mdev type on the source and dest. matching mdevtypes is not nessiarly trivial. we could do that but we woudl have to do that in python rather then sql so it would be slower to do at least today.
we dont currently have the ablity to say the resouce provider must have 1 of these set of traits. just that we must have a specific trait. this is a feature we have disucssed a couple of times and delayed untill we really really need it but its not out of the question that we could add it for this usecase. i suspect however we would do exact match first and explore this later after the inital mdev migration works.
Yes, I think it's good.
still, I'd like to put it more explicitly to make ensure it's not missed: the reason we want to specify compatible_type as a trait and check whether target compatible_type is the superset of source compatible_type is for the consideration of backward compatibility. e.g. an old generation device may have a mdev type xxx-v4-yyy, while a newer generation device may be of mdev type xxx-v5-yyy. with the compatible_type traits, the old generation device is still able to be regarded as compatible to newer generation device even their mdev types are not equal.
Thanks Yan
by the way i was looking at some vdpa reslated matiail today and noticed vdpa devices are nolonger usign mdevs and and now use a vhost chardev so i guess we will need a completely seperate mechanioum for vdpa vs mdev migration as a result. that is rather unfortunet but i guess that is life.
On Mon, 31 Aug 2020 12:43:44 +0800 Yan Zhao yan.y.zhao@intel.com wrote:
On Fri, Aug 28, 2020 at 03:04:12PM +0100, Sean Mooney wrote:
On Fri, 2020-08-28 at 15:47 +0200, Cornelia Huck wrote:
On Wed, 26 Aug 2020 14:41:17 +0800 Yan Zhao yan.y.zhao@intel.com wrote:
previously, we want to regard the two mdevs created with dsa-1dwq x 30 and dsa-2dwq x 15 as compatible, because the two mdevs consist equal resources.
But, as it's a burden to upper layer, we agree that if this condition happens, we still treat the two as incompatible.
To fix it, either the driver should expose dsa-1dwq only, or the target dsa-2dwq needs to be destroyed and reallocated via dsa-1dwq x 30.
AFAIU, these are mdev types, aren't they? So, basically, any management software needs to take care to use the matching mdev type on the target system for device creation?
or just do the simple thing of use the same mdev type on the source and dest. matching mdevtypes is not nessiarly trivial. we could do that but we woudl have to do that in python rather then sql so it would be slower to do at least today.
we dont currently have the ablity to say the resouce provider must have 1 of these set of traits. just that we must have a specific trait. this is a feature we have disucssed a couple of times and delayed untill we really really need it but its not out of the question that we could add it for this usecase. i suspect however we would do exact match first and explore this later after the inital mdev migration works.
Yes, I think it's good.
still, I'd like to put it more explicitly to make ensure it's not missed: the reason we want to specify compatible_type as a trait and check whether target compatible_type is the superset of source compatible_type is for the consideration of backward compatibility. e.g. an old generation device may have a mdev type xxx-v4-yyy, while a newer generation device may be of mdev type xxx-v5-yyy. with the compatible_type traits, the old generation device is still able to be regarded as compatible to newer generation device even their mdev types are not equal.
If you want to support migration from v4 to v5, can't the (presumably newer) driver that supports v5 simply register the v4 type as well, so that the mdev can be created as v4? (Just like QEMU versioned machine types work.)
still, I'd like to put it more explicitly to make ensure it's not missed: the reason we want to specify compatible_type as a trait and check whether target compatible_type is the superset of source compatible_type is for the consideration of backward compatibility. e.g. an old generation device may have a mdev type xxx-v4-yyy, while a newer generation device may be of mdev type xxx-v5-yyy. with the compatible_type traits, the old generation device is still able to be regarded as compatible to newer generation device even their mdev types are not equal.
If you want to support migration from v4 to v5, can't the (presumably newer) driver that supports v5 simply register the v4 type as well, so that the mdev can be created as v4? (Just like QEMU versioned machine types work.)
yes, it should work in some conditions. but it may not be that good in some cases when v5 and v4 in the name string of mdev type identify hardware generation (e.g. v4 for gen8, and v5 for gen9)
e.g. (1). when src mdev type is v4 and target mdev type is v5 as software does not support it initially, and v4 and v5 identify hardware differences. then after software upgrade, v5 is now compatible to v4, should the software now downgrade mdev type from v5 to v4? not sure if moving hardware generation info into a separate attribute from mdev type name is better. e.g. remove v4, v5 in mdev type, while use compatible_pci_ids to identify compatibility.
(2) name string of mdev type is composed by "driver_name + type_name". in some devices, e.g. qat, different generations of devices are binding to drivers of different names, e.g. "qat-v4", "qat-v5". then though type_name is equal, mdev type is not equal. e.g. "qat-v4-type1", "qat-v5-type1".
Thanks Yan
On Wed, 9 Sep 2020 10:13:09 +0800 Yan Zhao yan.y.zhao@intel.com wrote:
still, I'd like to put it more explicitly to make ensure it's not missed: the reason we want to specify compatible_type as a trait and check whether target compatible_type is the superset of source compatible_type is for the consideration of backward compatibility. e.g. an old generation device may have a mdev type xxx-v4-yyy, while a newer generation device may be of mdev type xxx-v5-yyy. with the compatible_type traits, the old generation device is still able to be regarded as compatible to newer generation device even their mdev types are not equal.
If you want to support migration from v4 to v5, can't the (presumably newer) driver that supports v5 simply register the v4 type as well, so that the mdev can be created as v4? (Just like QEMU versioned machine types work.)
yes, it should work in some conditions. but it may not be that good in some cases when v5 and v4 in the name string of mdev type identify hardware generation (e.g. v4 for gen8, and v5 for gen9)
e.g. (1). when src mdev type is v4 and target mdev type is v5 as software does not support it initially, and v4 and v5 identify hardware differences.
My first hunch here is: Don't introduce types that may be compatible later. Either make them compatible, or make them distinct by design, and possibly add a different, compatible type later.
then after software upgrade, v5 is now compatible to v4, should the software now downgrade mdev type from v5 to v4? not sure if moving hardware generation info into a separate attribute from mdev type name is better. e.g. remove v4, v5 in mdev type, while use compatible_pci_ids to identify compatibility.
If the generations are compatible, don't mention it in the mdev type. If they aren't, use distinct types, so that management software doesn't have to guess. At least that would be my naive approach here.
(2) name string of mdev type is composed by "driver_name + type_name". in some devices, e.g. qat, different generations of devices are binding to drivers of different names, e.g. "qat-v4", "qat-v5". then though type_name is equal, mdev type is not equal. e.g. "qat-v4-type1", "qat-v5-type1".
I guess that shows a shortcoming of that "driver_name + type_name" approach? Or maybe I'm just confused.