Hello, In my company (CloudFerro) we have developed two features related to nova ephemeral storage (with libvirt virt driver) - "SPDK-based ephemeral storage backend" and "Multiple ephemeral storage backend handling" (both described below). Would these features be appropriate to be added to upstream? If yes, then should I start with creating a blueprint and proposing a spec for each of them so it is more clear what we want to introduce? Of course not both at once, since there is some code that makes one work with the other. For us it would probably be best if we upstreamed "SPDK-based ephemeral storage backend" first and then upstreamed "Multiple ephemeral storage backend handling" Description of SPDK-based ephemeral storage backend: We add a new possible value for [libvirt]/images_type in nova.conf: spdk_lvol. If this value is set then local disks of instances are handled as logical volumes (lvols). of locally run SPDK instance (on the same compute as VM). See https://spdk.io/doc/logical_volumes.html#lvol for docs on this subject. These are essentially a part of local NVMe disk managed by SPDK. We create and manage those lvols by making calls from nova-compute to SPDK instance with RPC (https://spdk.io/doc/jsonrpc.html) using provided python library (https://github.com/spdk/spdk/blob/master/python/README.md). We attach those lvols to instances by exposing them as vhost-blk devices (see https://spdk.io/doc/vhost.html) and specifying them as disks with source_type='vhostuser'. This method of exposing NVMe local storage allows for much better I/O performance than exposing them from local filesystem. We currently have working: creating, deleting, cold-migrating, shelving, snapshots, unshelving VMs with this storage backend. We don't have live migration working yet but have it in plans. This feature includes changes only to nova. Description of Multiple ephemeral storage backend handling: We add a new configuration option [libvirt]/supported_image_types to nova.conf of nova-compute and change the meaning of [libvirt]/images_type to mean default image type. If libvirt:image_type extra_spec is specified in flavor then VM is scheduled on compute with appropriate image type in its [libvirt]/supported_image_types. We use potentially multiple DISK_GB resource providers and construct appropriate request groups to handle this scheduling. This spec is also used to fill driver_info.libvirt.image_type field of BDMs with destination_type=local of this VM. (driver_info is new JSON serialized field of BDM) Then in compute if BDM specifies this field we use its value instead of [libvirt]/images_type to decide on imagebackend to use for it. This method of handling multiple backends only works after being enabled by administrators by setting libvirt:image_type on all their flavors and running nova-manage commands that update existing VMs. Without enabling everything works as it was before. This feature includes changes to nova and some new traits in os-traits (One per possible value of [libvirt]/images_type + one extra) Best regards, Karol Klimaszewski
On 28/11/2025 12:44, Karol Klimaszewski wrote:
Hello,
In my company (CloudFerro) we have developed two features related to nova ephemeral storage (with libvirt virt driver) - "SPDK-based ephemeral storage backend" and "Multiple ephemeral storage backend handling" (both described below). Would these features be appropriate to be added to upstream? we would need to know a little more about them to say but in princiape both could be one note on wording
ephamal sotage in nova referes pirmarlay to addtion no root disks allocated by nova for a vm i.e. flaovor.epheral_gb general sotage that is provided by nova is "nova provisioned storage" i avoid using the term ephemeral storage to refer to the nova vm root disk.
If yes, then should I start with creating a blueprint and proposing a spec for each of them so it is more clear what we want to introduce?
Of course not both at once, since there is some code that makes one work with the other. For us it would probably be best if we upstreamed "SPDK-based ephemeral storage backend" first and then upstreamed "Multiple ephemeral storage backend handling"
ack so you woudl first like to add a new images_type backend for using vhost-user wiht virtio blk device back by SPDK correct? that woudl allow you to have some host with spdk backed storage and other with rbd or qcow for examples then once that capablity is avaiabel work on allowing a single host to have multiple stroage backends enabled at the same time so you nolonger need to partion your cloud into diffent hosts of different stoage backend correct?
Description of SPDK-based ephemeral storage backend:
We add a new possible value for [libvirt]/images_type in nova.conf: spdk_lvol. If this value is set then local disks of instances are handled as logical volumes (lvols). of locally run SPDK instance (on the same compute as VM). See https://spdk.io/doc/logical_volumes.html#lvol for docs on this subject. These are essentially a part of local NVMe disk managed by SPDK.
SPDK was created as a spin off form DPDK so while its been a few years i was quite familar with it when i used to work at intel. it can operate on over nvme ssd but its also possibel to use it with hardrives importantly that also means you can actully deploy it in a vm in the ci and it can use a loopback blockdevice or similar as a backign store for testing. that woudl obviouly not be corect for production usage but it means there is not specific hardware requrieemnt to deploy this opensouce sotorage solution with devstack or in our ci.
We create and manage those lvols by making calls from nova-compute to SPDK instance with RPC (https://spdk.io/doc/jsonrpc.html) using provided python library (https://github.com/spdk/spdk/blob/master/python/README.md). We attach those lvols to instances by exposing them as vhost-blk devices (see https://spdk.io/doc/vhost.html) and specifying them as disks with source_type='vhostuser'. yes this is usign the same transpaort as we use for virtio-net devices with dpdk and ovs.
one of the requirement for this to work is that you have hared memory access which mean using file backed memory, huge-pages or memfd. the former to are already supported in nova but https://review.opendev.org/c/openstack/nova-specs/+/951689 would make it much more user friendly for operators and end-users. without using one of those alternative memory modes the vm will start but the spdk appcltion will not be able to map the guest memory regions to implement its part of the vhost-user portocol.
This method of exposing NVMe local storage allows for much better I/O performance than exposing them from local filesystem.
if we were not to enable this natively in nova another option may be to do this with cyborg. cyborg has a spdk driver for this reason today not that its maintiaaned https://github.com/openstack/cyborg/tree/master/cyborg/accelerator/drivers/s... but the nova enableing of using vhost-user for block devices was never completed. so perhaps we could enable both usecase at the same time or collablerate on doing this via cybrog. one of the usecase i want ot bring to cybrog going forward it the ablity to attach addtional local storage resocues to a vm, i was intially thinking of an lvm dirver and nvme namesapce driver to suplement the sdpk and ssd drivers that already exist.
We currently have working: creating, deleting, cold-migrating, shelving, snapshots, unshelving VMs with this storage backend. We don't have live migration working yet but have it in plans.
live migration i think woudl only require precreating the relevent spdk resouces in pre-livemigraiotn on the destionat host correct. it shoudl be possibel to generate stable vhost-user socket path but if not you woudl also need to update the live migration data object returned by prelive migration and extedn the xml update logic top moreify the paths when generated the migration xml. as far as i am aware qemu/libvirt fully support live migration with spdk and have for a long time.
This feature includes changes only to nova.
Description of Multiple ephemeral storage backend handling:
We add a new configuration option [libvirt]/supported_image_types to nova.conf of nova-compute and change the meaning of [libvirt]/images_type to mean default image type.
we can rat whole on the name but we woudl probaly not use supported_image_types as that can easilly be confused with the format of the image i woudl generally propose deprecating [libvirt]/images_type and intoducing it with a preferencally ordered list of storage backedns ``` [libvirt] storage_backend=spdk,rbd,qcow ``` i.e. spdk is the default follow by rbd followed by qcow that woudl allow you to have a new flavor extra spec hw:storage_backends=qcow,rbd meaning this flaovr woudl perfer to slect host with qcow sotrage first and rbd second but not use spdk i.e. because it does not use hugepages and wont work.
If libvirt:image_type extra_spec is specified in flavor then VM is scheduled on compute with appropriate image type in its [libvirt]/supported_image_types. We use potentially multiple DISK_GB resource providers and construct appropriate request groups to handle this scheduling. ya this is similar ot what i said above i just think image_type is a bit to generic and misleading as it has nothign to do with the actual glance image. hence `hw:storage_backends` or jsut `hw:storage_backend`
we also have a naming convention which inlcude not using the name so software project in our extra specs so instead of cyborg we use acclerator, so we woudl not us livbirt: as the namespace for an extra spec. As the affect how we virutalaise the hardware preseetned to the vm it would make sese to use the hw: prefix/namespace using hw: would alos technally allow other virt drivers to impelement it in the future in theory.
This spec is also used to fill driver_info.libvirt.image_type field of BDMs with destination_type=local of this VM. (driver_info is new JSON serialized field of BDM) Then in compute if BDM specifies this field we use its value instead of [libvirt]/images_type to decide on imagebackend to use for it. do you envision allwoing different dicsk ot use diffent storage backends.
nova can provision 3 types of disk today the root disk, 0-N addtional ephemeral disks and a swap disk today we alwasy uses the saem storage backedn for all 3 so if you have images_type=rbd the swap and ephemeral disks are also created on ceph the advantage of `hw:storage_backends=qcow,rbd` is we could supprot the following `hw:storage_backends=swap:qcow,ephmeral:spdk,root:rbd` so the root disk woudl be backed by a ha ceph cluster for fault tollerance of a host enabling evacuation without dataloss. the swap disk which is rarely hevlliy used woudl use qcow sotrage providing a morderate level of perforamce which is approatate for swap with the ablity to oversubscribe sotraage as a benift since most of the time it wont be used. and finally you cha have high speed local scratch disk space backed by spdk. all in one vm. to allowing diffent storage types(swap,root/ephemeral) to come form diffent storage backend is one of the key reasons why we shoudl evengualy supprot mutlipel stroage backend on a single host.
This method of handling multiple backends only works after being enabled by administrators by setting libvirt:image_type on all their flavors and running nova-manage commands that update existing VMs.
well we dont allow modifyitn the flavore via nova-manage today the official way to update exciting instnace is to resize however we do nto currenlty supprot resizeign between sotrage backends. i think if we were to do this work we would want to require that. i dont think we would provide a nova-manage command for this since nova manage in not really intended to be sued on the comptue node. it can be but the once command we added is sort of problematic but that a different topic.
Without enabling everything works as it was before. This feature includes changes to nova and some new traits in os-traits (One per possible value of [libvirt]/images_type + one extra)
Best regards,
Karol Klimaszewski
Thank you for the response, Sean. Sorry for the late reply. On 28.11.2025 18:16, Sean Mooney wrote:
On 28/11/2025 12:44, Karol Klimaszewski wrote:
Hello,
In my company (CloudFerro) we have developed two features related to nova ephemeral storage (with libvirt virt driver) - "SPDK-based ephemeral storage backend" and "Multiple ephemeral storage backend handling" (both described below). Would these features be appropriate to be added to upstream? we would need to know a little more about them to say but in princiape both could be one note on wording
ephamal sotage in nova referes pirmarlay to addtion no root disks allocated by nova for a vm i.e. flaovor.epheral_gb
general sotage that is provided by nova is "nova provisioned storage"
i avoid using the term ephemeral storage to refer to the nova vm root disk.
OK, thanks for the clarification!
If yes, then should I start with creating a blueprint and proposing a spec for each of them so it is more clear what we want to introduce?
Of course not both at once, since there is some code that makes one work with the other. For us it would probably be best if we upstreamed "SPDK-based ephemeral storage backend" first and then upstreamed "Multiple ephemeral storage backend handling"
ack so you woudl first like to add a new images_type backend for using vhost-user wiht virtio blk device back by SPDK correct? that woudl allow you to have some host with spdk backed storage and other with rbd or qcow for examples
then once that capablity is avaiabel work on allowing a single host to have multiple stroage backends enabled at the same time so you nolonger need to partion your cloud into diffent hosts of different stoage backend correct?
Yes, that is correct.
Description of SPDK-based ephemeral storage backend:
We add a new possible value for [libvirt]/images_type in nova.conf: spdk_lvol. If this value is set then local disks of instances are handled as logical volumes (lvols). of locally run SPDK instance (on the same compute as VM). See https://spdk.io/doc/logical_volumes.html#lvol for docs on this subject. These are essentially a part of local NVMe disk managed by SPDK.
SPDK was created as a spin off form DPDK so while its been a few years i was quite familar with it when i used to work at intel.
it can operate on over nvme ssd but its also possibel to use it with hardrives importantly that also means you can actully deploy it in a vm in the ci and it can use a loopback blockdevice or similar as a backign store for testing.
that woudl obviouly not be corect for production usage but it means there is not specific hardware requrieemnt to deploy this opensouce sotorage solution with devstack or in our ci.
We didn't test such approach but this should work. Especially since our solution works by creating lvols on top of lvstores that were created prior to nova-compute startup. We do not care whether they are on top of NVMe disk, aio bdev running on a harddrive or a loopback bdev. Some work would probably be needed to setup SPDK vhost_tgt instance running in a devstack however.
We create and manage those lvols by making calls from nova-compute to SPDK instance with RPC (https://spdk.io/doc/jsonrpc.html) using provided python library (https://github.com/spdk/spdk/blob/master/python/README.md). We attach those lvols to instances by exposing them as vhost-blk devices (see https://spdk.io/doc/vhost.html) and specifying them as disks with source_type='vhostuser'. yes this is usign the same transpaort as we use for virtio-net devices with dpdk and ovs.
one of the requirement for this to work is that you have hared memory access which mean using file backed memory, huge-pages or memfd. the former to are already supported in nova but https://review.opendev.org/c/openstack/nova-specs/+/951689 would make it much more user friendly for operators and end-users.
On our end we used huge-pages based shared memory and its true that it was a bit problematic to use. Thanks for the heads up on memfd, that for sure will be useful for spdk based instances and we will definitely look into this on our side.
without using one of those alternative memory modes the vm will start but the spdk appcltion will not be able to map the guest memory regions to implement its part of the vhost-user portocol.
From our experience trying to run vhostuser devices without shared memory is prevented by libvirt following error is thrown: ibvirt.libvirtError: unsupported configuration: 'vhostuser' requires shared memory
This method of exposing NVMe local storage allows for much better I/O performance than exposing them from local filesystem.
if we were not to enable this natively in nova another option may be to do this with cyborg. cyborg has a spdk driver for this reason today not that its maintiaaned
https://github.com/openstack/cyborg/tree/master/cyborg/accelerator/drivers/s...
but the nova enableing of using vhost-user for block devices was never completed. so perhaps we could enable both usecase at the same time or collablerate on doing this via cybrog.
one of the usecase i want ot bring to cybrog going forward it the ablity to attach addtional local storage resocues to a vm, i was intially thinking of an lvm dirver and nvme namesapce driver to suplement the sdpk and ssd drivers that already exist.
I am not very familiar with with cyborg component, but from what I understand this would entirely separate spdk disk from nova's imagebackend, spdk-based disks would be specified using something like accel:device_profile and it would only work for additional disks, not root disk. Is this correct? Would doing it via cyborg be preferred? To be honest it would be easier for me to introduce the nova-based approach since we already have a version of it working in our environment. But if necessary I can look into how it would be done with cyborg.
We currently have working: creating, deleting, cold-migrating, shelving, snapshots, unshelving VMs with this storage backend. We don't have live migration working yet but have it in plans.
live migration i think woudl only require precreating the relevent spdk resouces in pre-livemigraiotn on the destionat host correct. it shoudl be possibel to generate stable vhost-user socket path but if not you woudl also need to update the live migration data object returned by prelive migration and extedn the xml update logic top moreify the paths when generated the migration xml.
as far as i am aware qemu/libvirt fully support live migration with spdk and have for a long time.
I will look into this, thank you. With this approach will data on the spdk disk be the same as before migration? Also do you feel that live migration support is needed in initial version of this feature?
This feature includes changes only to nova.
Description of Multiple ephemeral storage backend handling:
We add a new configuration option [libvirt]/supported_image_types to nova.conf of nova-compute and change the meaning of [libvirt]/images_type to mean default image type.
we can rat whole on the name but we woudl probaly not use supported_image_types as that can easilly be confused with the format of the image
i woudl generally propose deprecating [libvirt]/images_type
and intoducing it with a preferencally ordered list of storage backedns
``` [libvirt] storage_backend=spdk,rbd,qcow ```
i.e. spdk is the default follow by rbd followed by qcow
I agree that this approach looks better. We kept images_type in our version since we didn't want straying too far from the upstream in our changes.
that woudl allow you to have a new flavor extra spec hw:storage_backends=qcow,rbd meaning this flaovr woudl perfer to slect host with qcow sotrage first and rbd second but not use spdk i.e. because it does not use hugepages and wont work.
One problem I see with this approach is that placement does not currently allow alternatives in request for allocation candidates. If we wanted either qcow or rbd storage backend we would need two requests one for DISK_GB on RP with trait COMPUTE_STORAGE_BACKEND_QCOW and another for DISK_GB on RP with trait COMPUTE_STORAGE_BACKEND_RBD. And if we included different storage backends for root, swap, ephemeral we would need request for every possibility. But besides that this could probably also work.
If libvirt:image_type extra_spec is specified in flavor then VM is scheduled on compute with appropriate image type in its [libvirt]/supported_image_types. We use potentially multiple DISK_GB resource providers and construct appropriate request groups to handle this scheduling. ya this is similar ot what i said above i just think image_type is a bit to generic and misleading as it has nothign to do with the actual glance image. hence `hw:storage_backends` or jsut `hw:storage_backend`
we also have a naming convention which inlcude not using the name so software project in our extra specs
so instead of cyborg we use acclerator, so we woudl not us livbirt: as the namespace for an extra spec.
As the affect how we virutalaise the hardware preseetned to the vm it would make sese to use the hw: prefix/namespace
using hw: would alos technally allow other virt drivers to impelement it in the future in theory.
Yes, this would probably be a better name.
This spec is also used to fill driver_info.libvirt.image_type field of BDMs with destination_type=local of this VM. (driver_info is new JSON serialized field of BDM) Then in compute if BDM specifies this field we use its value instead of [libvirt]/images_type to decide on imagebackend to use for it. do you envision allwoing different dicsk ot use diffent storage backends.
nova can provision 3 types of disk today the root disk, 0-N addtional ephemeral disks and a swap disk
today we alwasy uses the saem storage backedn for all 3
so if you have images_type=rbd the swap and ephemeral disks are also created on ceph
the advantage of `hw:storage_backends=qcow,rbd` is we could supprot the following `hw:storage_backends=swap:qcow,ephmeral:spdk,root:rbd`
so the root disk woudl be backed by a ha ceph cluster for fault tollerance of a host enabling evacuation without dataloss. the swap disk which is rarely hevlliy used woudl use qcow sotrage providing a morderate level of perforamce which is approatate for swap with the ablity to oversubscribe sotraage as a benift since most of the time it wont be used. and finally you cha have high speed local scratch disk space backed by spdk. all in one vm.
to allowing diffent storage types(swap,root/ephemeral) to come form diffent storage backend is one of the key reasons why we shoudl evengualy supprot mutlipel stroage backend on a single host.
In our environment we don't really use additional ephemeral disks and swap disks always used same storage backend as root disk. But with some work those changes should be able to using different storage backends for different disks.
This method of handling multiple backends only works after being enabled by administrators by setting libvirt:image_type on all their flavors and running nova-manage commands that update existing VMs.
well we dont allow modifyitn the flavore via nova-manage today
the official way to update exciting instnace is to resize however we do nto currenlty supprot resizeign between sotrage backends. i think if we were to do this work we would want to require that.
i dont think we would provide a nova-manage command for this since nova manage in not really intended to be sued on the comptue node. it can be but the once command we added is sort of problematic but that a different topic.
We did this way since we wanted to allow reusing existing flavors which for us were already storage backend restricted (but based on aggregates, not traits and BDM properties) with new multibackend approach. So we would update flavors with libvirt:image_type property (not using nova-manage), then run the nova-mange script which would update libvirt:image_type stored in instance.flavor if needed and then update bdm.driver_info.libvirt.image_type. This way we could "upgrade" to using multibackend without affecting customers. In general we found instances without driver_info.libvirt.image_type set in its BDMs with destination_type=local to cause some issues if we enable [libvirt]/supported_image_types. For example it would be impossible to tell storage backend of BDM in pre_migrate on dest host. Or instances would unshelve with incorrect storage backend. This is why we decided that it would be better that setting driver_info.libvirt.image_type on all BDMs with destination_type=local should be a prerequisite to setting [libvirt]/supported_image_types to multiple values on any compute.
On 08/12/2025 13:09, Karol Klimaszewski wrote:
Thank you for the response, Sean. Sorry for the late reply.
On 28.11.2025 18:16, Sean Mooney wrote:
Hello,
In my company (CloudFerro) we have developed two features related to nova ephemeral storage (with libvirt virt driver) - "SPDK-based ephemeral storage backend" and "Multiple ephemeral storage backend handling" (both described below). Would these features be appropriate to be added to upstream? we would need to know a little more about them to say but in
On 28/11/2025 12:44, Karol Klimaszewski wrote: princiape both could be one note on wording
ephamal sotage in nova referes pirmarlay to addtion no root disks allocated by nova for a vm i.e. flaovor.epheral_gb
general sotage that is provided by nova is "nova provisioned storage"
i avoid using the term ephemeral storage to refer to the nova vm root disk.
OK, thanks for the clarification!
If yes, then should I start with creating a blueprint and proposing a spec for each of them so it is more clear what we want to introduce?
Of course not both at once, since there is some code that makes one work with the other. For us it would probably be best if we upstreamed "SPDK-based ephemeral storage backend" first and then upstreamed "Multiple ephemeral storage backend handling"
ack so you woudl first like to add a new images_type backend for using vhost-user wiht virtio blk device back by SPDK correct? that woudl allow you to have some host with spdk backed storage and other with rbd or qcow for examples
then once that capablity is avaiabel work on allowing a single host to have multiple stroage backends enabled at the same time so you nolonger need to partion your cloud into diffent hosts of different stoage backend correct?
Yes, that is correct.
Description of SPDK-based ephemeral storage backend:
We add a new possible value for [libvirt]/images_type in nova.conf: spdk_lvol. If this value is set then local disks of instances are handled as logical volumes (lvols). of locally run SPDK instance (on the same compute as VM). See https://spdk.io/doc/logical_volumes.html#lvol for docs on this subject. These are essentially a part of local NVMe disk managed by SPDK.
SPDK was created as a spin off form DPDK so while its been a few years i was quite familar with it when i used to work at intel.
it can operate on over nvme ssd but its also possibel to use it with hardrives importantly that also means you can actully deploy it in a vm in the ci and it can use a loopback blockdevice or similar as a backign store for testing.
that woudl obviouly not be corect for production usage but it means there is not specific hardware requrieemnt to deploy this opensouce sotorage solution with devstack or in our ci.
We didn't test such approach but this should work. Especially since our solution works by creating lvols on top of lvstores that were created prior to nova-compute startup. We do not care whether they are on top of NVMe disk, aio bdev running on a harddrive or a loopback bdev. Some work would probably be needed to setup SPDK vhost_tgt instance running in a devstack however. ack, i was just calling out that ideally if we integrate nova with spdk we woudl be able to test it in the first party ci fo cinder we use a loopback device for the cinder lvm driver which supprot iscsi or nvemof for spdk for development we could do the same. setup spdk on top of a loopback block device.
We create and manage those lvols by making calls from nova-compute to SPDK instance with RPC (https://spdk.io/doc/jsonrpc.html) using provided python library (https://github.com/spdk/spdk/blob/master/python/README.md). We attach those lvols to instances by exposing them as vhost-blk devices (see https://spdk.io/doc/vhost.html) and specifying them as disks with source_type='vhostuser'. yes this is usign the same transpaort as we use for virtio-net devices with dpdk and ovs.
one of the requirement for this to work is that you have hared memory access which mean using file backed memory, huge-pages or memfd. the former to are already supported in nova but https://review.opendev.org/c/openstack/nova-specs/+/951689 would make it much more user friendly for operators and end-users.
On our end we used huge-pages based shared memory and its true that it was a bit problematic to use. Thanks for the heads up on memfd, that for sure will be useful for spdk based instances and we will definitely look into this on our side.
without using one of those alternative memory modes the vm will start but the spdk appcltion will not be able to map the guest memory regions to implement its part of the vhost-user portocol.
From our experience trying to run vhostuser devices without shared memory is prevented by libvirt following error is thrown: ibvirt.libvirtError: unsupported configuration: 'vhostuser' requires shared memory
this must be relitivly new. its good they are not enforcing this requrement the old behavior was the vm woudl boot without any network connectivity when we used ovs-dpdk. on a related note does spdk supprot server mode vhost-user socket wehre qemu is the socket server and spdk is the client?
This method of exposing NVMe local storage allows for much better I/O performance than exposing them from local filesystem.
if we were not to enable this natively in nova another option may be to do this with cyborg. cyborg has a spdk driver for this reason today not that its maintiaaned
https://github.com/openstack/cyborg/tree/master/cyborg/accelerator/drivers/s...
but the nova enableing of using vhost-user for block devices was never completed. so perhaps we could enable both usecase at the same time or collablerate on doing this via cybrog.
one of the usecase i want ot bring to cybrog going forward it the ablity to attach addtional local storage resocues to a vm, i was intially thinking of an lvm dirver and nvme namesapce driver to suplement the sdpk and ssd drivers that already exist.
I am not very familiar with with cyborg component, but from what I understand this would entirely separate spdk disk from nova's imagebackend, spdk-based disks would be specified using something like accel:device_profile and it would only work for additional disks, not root disk. Is this correct? Would doing it via cyborg be preferred? To be honest it would be easier for me to introduce the nova-based approach since we already have a version of it working in our environment. But if necessary I can look into how it would be done with cyborg.
requiring cyborg to support this is a rather heavy ask as the project is effecitvly inactive currently. i was more pointing out that if you exeteneded the libvirt driver to supprot vhost-user for block devices that woudl allow us to manage thos eitehr via cyborg or a new nova images_type backend or via a cinder volume backend the core functionally to generate the device xml for the libvirt domain is the same and woudl enable all 3 usecause currently the cyborg spdk work that was created requires out of tree changes to nova to function precisly because the generic support for vhsot-user for block devices is not available in the libvirt driver. that prevents spdk cyborg or cinder drivers form working without forking nova to add that support.
We currently have working: creating, deleting, cold-migrating, shelving, snapshots, unshelving VMs with this storage backend. We don't have live migration working yet but have it in plans.
live migration i think woudl only require precreating the relevent spdk resouces in pre-livemigraiotn on the destionat host correct. it shoudl be possibel to generate stable vhost-user socket path but if not you woudl also need to update the live migration data object returned by prelive migration and extedn the xml update logic top moreify the paths when generated the migration xml.
as far as i am aware qemu/libvirt fully support live migration with spdk and have for a long time.
I will look into this, thank you. With this approach will data on the spdk disk be the same as before migration?
Also do you feel that live migration support is needed in initial version of this feature? in general we are reluctant to add feature that don't work with live migration unless there is a documented limitation
for local blockdevice attached as vhost-user (spdk backend) host devices we woudl have to test but i would expect the qemu block device layer to copy the data the same way it does if you assign a locl disk or file. that prevent it in qemu or simialr. i.e. if it cant be support because of a depency that ok if its being kept out of scope just to limit the scope we would prefer to have it in the initial version as long as evacuate and cold migrate work in the inital version adding live migrate later would be ok but we really prefer to have at least 1 move operation that can be used for host maintenance.
This feature includes changes only to nova.
Description of Multiple ephemeral storage backend handling:
We add a new configuration option [libvirt]/supported_image_types to nova.conf of nova-compute and change the meaning of [libvirt]/images_type to mean default image type.
we can rat whole on the name but we woudl probaly not use supported_image_types as that can easilly be confused with the format of the image
i woudl generally propose deprecating [libvirt]/images_type
and intoducing it with a preferencally ordered list of storage backedns
``` [libvirt] storage_backend=spdk,rbd,qcow ```
i.e. spdk is the default follow by rbd followed by qcow
I agree that this approach looks better. We kept images_type in our version since we didn't want straying too far from the upstream in our changes.
that woudl allow you to have a new flavor extra spec hw:storage_backends=qcow,rbd meaning this flaovr woudl perfer to slect host with qcow sotrage first and rbd second but not use spdk i.e. because it does not use hugepages and wont work.
One problem I see with this approach is that placement does not currently allow alternatives in request for allocation candidates. If we wanted either qcow or rbd storage backend we would need two requests one for DISK_GB on RP with trait COMPUTE_STORAGE_BACKEND_QCOW and another for DISK_GB on RP with trait COMPUTE_STORAGE_BACKEND_RBD. And if we included different storage backends for root, swap, ephemeral we would need request for every possibility.
for resouces classes you are correct but not for traits. i was thinking this woudl work more like the numa in placement approch for hugepages you woudl have nested resouce providers of DISK_GB with the relevent traits COMPUTE_STORAGE_BACKEND_QCOW, COMPUTE_STORAGE_BACKEND_RBD and either you woudl not include the trait request if hw:storage_backends had more then one backend and filter in the nova schduler, or preferably you woudl use the required=in:COMPUTE_STORAGE_BACKEND_QCOW,COMPUTE_STORAGE_BACKEND_RBD the ablity to require that any of a trait match was added in yoga https://docs.openstack.org/placement/latest/specs/yoga/implemented/2005346-a... you would have to move the disk request into its own request group to use that properly
But besides that this could probably also work.
If libvirt:image_type extra_spec is specified in flavor then VM is scheduled on compute with appropriate image type in its [libvirt]/supported_image_types. We use potentially multiple DISK_GB resource providers and construct appropriate request groups to handle this scheduling. ya this is similar ot what i said above i just think image_type is a bit to generic and misleading as it has nothign to do with the actual glance image. hence `hw:storage_backends` or jsut `hw:storage_backend`
we also have a naming convention which inlcude not using the name so software project in our extra specs
so instead of cyborg we use acclerator, so we woudl not us livbirt: as the namespace for an extra spec.
As the affect how we virutalaise the hardware preseetned to the vm it would make sese to use the hw: prefix/namespace
using hw: would alos technally allow other virt drivers to impelement it in the future in theory.
Yes, this would probably be a better name.
This spec is also used to fill driver_info.libvirt.image_type field of BDMs with destination_type=local of this VM. (driver_info is new JSON serialized field of BDM) Then in compute if BDM specifies this field we use its value instead of [libvirt]/images_type to decide on imagebackend to use for it. do you envision allwoing different dicsk ot use diffent storage backends.
nova can provision 3 types of disk today the root disk, 0-N addtional ephemeral disks and a swap disk
today we alwasy uses the saem storage backedn for all 3
so if you have images_type=rbd the swap and ephemeral disks are also created on ceph
the advantage of `hw:storage_backends=qcow,rbd` is we could supprot the following `hw:storage_backends=swap:qcow,ephmeral:spdk,root:rbd`
so the root disk woudl be backed by a ha ceph cluster for fault tollerance of a host enabling evacuation without dataloss. the swap disk which is rarely hevlliy used woudl use qcow sotrage providing a morderate level of perforamce which is approatate for swap with the ablity to oversubscribe sotraage as a benift since most of the time it wont be used. and finally you cha have high speed local scratch disk space backed by spdk. all in one vm.
to allowing diffent storage types(swap,root/ephemeral) to come form diffent storage backend is one of the key reasons why we shoudl evengualy supprot mutlipel stroage backend on a single host.
In our environment we don't really use additional ephemeral disks and swap disks always used same storage backend as root disk. But with some work those changes should be able to using different storage backends for different disks.
the ability to use diffident backend for each storage type is more of a nice to have eventually but it is a way i would hope this could evolve over time.
This method of handling multiple backends only works after being enabled by administrators by setting libvirt:image_type on all their flavors and running nova-manage commands that update existing VMs.
well we dont allow modifyitn the flavore via nova-manage today
the official way to update exciting instnace is to resize however we do nto currenlty supprot resizeign between sotrage backends. i think if we were to do this work we would want to require that.
i dont think we would provide a nova-manage command for this since nova manage in not really intended to be sued on the comptue node. it can be but the once command we added is sort of problematic but that a different topic.
We did this way since we wanted to allow reusing existing flavors which for us were already storage backend restricted (but based on aggregates, not traits and BDM properties) with new multibackend approach.
So we would update flavors with libvirt:image_type property (not using nova-manage), then run the nova-mange script which would update libvirt:image_type stored in instance.flavor if needed and then update bdm.driver_info.libvirt.image_type.
This way we could "upgrade" to using multibackend without affecting customers.
ya that more or less violates how flavor are intended to be used. we maybe open to adding add min command to allow updating embedded flavor similar to what we did for image properties but that would not be acceptable as the priamry upgrade mechanism we also are unlikely to allow you to modify the placement allocation via nova-mange as part fo this flavor update. the only thing we would suport in this regard upstream is to adopt the new fuctionality via a resize. once selecting a image backend via the flavor is posible its part of the api contract so its not something that shoudl ever be modifed by operators in general. for exampel fi you currently are usign raw files on nvme ssd adn move to rbd the iops will reduce and it coudl break workload liek etcd which generally performs pooly on ceph. as an operator of a private or public cloud this is something yoru free to do as a business decision but form an upstream perspective flavor are intended to be immutable once created we only allow modifying the flavor extra specs separate from the flavor create to allow you do add them one at a time but once a vm is useign a flavor you should not modify it again in the future if you are following the best practices for operating an openstack cloud. we snap shot the flavor in the insntace precisely to prevent operator form adding extra spec in teh future and having that break the api contract that was established with the end-user when the instance was created.
In general we found instances without driver_info.libvirt.image_type set in its BDMs with destination_type=local to cause some issues if we enable [libvirt]/supported_image_types. For example it would be impossible to tell storage backend of BDM in pre_migrate on dest host. Or instances would unshelve with incorrect storage backend.
This is why we decided that it would be better that setting driver_info.libvirt.image_type on all BDMs with destination_type=local should be a prerequisite to setting [libvirt]/supported_image_types to multiple values on any compute.
this is a slightly different problem statement. we often record the existing sate fo the instance when we make something configurable we do this by backfilign the previous default or configured value in the db on compute agent startup when we are introduction the ability to change a value or updating its default. we do not do this as a nova-mange command however and when this si done and it impact the placement allocations we requrie taht you provide a placement reshap to do this automatically on the startup of the compute agent. basically assuming disk woudl move to a nested rescoue provider then on agent startup you will need to do a placement reshape to move the inventory and allcoation to the new resouce provider for the current instance allocations. as part fo that you can also record in the instance via some means that it is useign RBD or watherver backend is confiugred. that need to happen automatically when the agent is started with the new feature enabled. the exact mechanics of that need to be detailed in the relevant spec in the proposed changes and upgrade impact sections. https://specs.openstack.org/openstack/nova-specs/specs/train/implemented/cpu... the configuration-options, Flavor extra specs and image metadata properties, Placement inventory and Summary section are all part of the upgarde imapct you can find similar description of how inplace upgrade are suported in https://specs.openstack.org/openstack/nova-specs/specs/2023.1/implemented/pc... so that is the same approch you woudl have to follow. we shoudl not require that all flavor now have hw:storage_backends and any flavor without it set shoudl be eligable for any storage backend once a instnace is allocated on a given backend it would be ok to recored that in teh instance to ensure it is move within the same backend type but existing flavor need to work without modifcation today and existing instance also need to work after this feature is introduced without admins using nova-manage to update them. regards sean
participants (2)
-
Karol Klimaszewski
-
Sean Mooney