Thank you for the response, Sean. Sorry for the late reply. On 28.11.2025 18:16, Sean Mooney wrote:
On 28/11/2025 12:44, Karol Klimaszewski wrote:
Hello,
In my company (CloudFerro) we have developed two features related to nova ephemeral storage (with libvirt virt driver) - "SPDK-based ephemeral storage backend" and "Multiple ephemeral storage backend handling" (both described below). Would these features be appropriate to be added to upstream? we would need to know a little more about them to say but in princiape both could be one note on wording
ephamal sotage in nova referes pirmarlay to addtion no root disks allocated by nova for a vm i.e. flaovor.epheral_gb
general sotage that is provided by nova is "nova provisioned storage"
i avoid using the term ephemeral storage to refer to the nova vm root disk.
OK, thanks for the clarification!
If yes, then should I start with creating a blueprint and proposing a spec for each of them so it is more clear what we want to introduce?
Of course not both at once, since there is some code that makes one work with the other. For us it would probably be best if we upstreamed "SPDK-based ephemeral storage backend" first and then upstreamed "Multiple ephemeral storage backend handling"
ack so you woudl first like to add a new images_type backend for using vhost-user wiht virtio blk device back by SPDK correct? that woudl allow you to have some host with spdk backed storage and other with rbd or qcow for examples
then once that capablity is avaiabel work on allowing a single host to have multiple stroage backends enabled at the same time so you nolonger need to partion your cloud into diffent hosts of different stoage backend correct?
Yes, that is correct.
Description of SPDK-based ephemeral storage backend:
We add a new possible value for [libvirt]/images_type in nova.conf: spdk_lvol. If this value is set then local disks of instances are handled as logical volumes (lvols). of locally run SPDK instance (on the same compute as VM). See https://spdk.io/doc/logical_volumes.html#lvol for docs on this subject. These are essentially a part of local NVMe disk managed by SPDK.
SPDK was created as a spin off form DPDK so while its been a few years i was quite familar with it when i used to work at intel.
it can operate on over nvme ssd but its also possibel to use it with hardrives importantly that also means you can actully deploy it in a vm in the ci and it can use a loopback blockdevice or similar as a backign store for testing.
that woudl obviouly not be corect for production usage but it means there is not specific hardware requrieemnt to deploy this opensouce sotorage solution with devstack or in our ci.
We didn't test such approach but this should work. Especially since our solution works by creating lvols on top of lvstores that were created prior to nova-compute startup. We do not care whether they are on top of NVMe disk, aio bdev running on a harddrive or a loopback bdev. Some work would probably be needed to setup SPDK vhost_tgt instance running in a devstack however.
We create and manage those lvols by making calls from nova-compute to SPDK instance with RPC (https://spdk.io/doc/jsonrpc.html) using provided python library (https://github.com/spdk/spdk/blob/master/python/README.md). We attach those lvols to instances by exposing them as vhost-blk devices (see https://spdk.io/doc/vhost.html) and specifying them as disks with source_type='vhostuser'. yes this is usign the same transpaort as we use for virtio-net devices with dpdk and ovs.
one of the requirement for this to work is that you have hared memory access which mean using file backed memory, huge-pages or memfd. the former to are already supported in nova but https://review.opendev.org/c/openstack/nova-specs/+/951689 would make it much more user friendly for operators and end-users.
On our end we used huge-pages based shared memory and its true that it was a bit problematic to use. Thanks for the heads up on memfd, that for sure will be useful for spdk based instances and we will definitely look into this on our side.
without using one of those alternative memory modes the vm will start but the spdk appcltion will not be able to map the guest memory regions to implement its part of the vhost-user portocol.
From our experience trying to run vhostuser devices without shared memory is prevented by libvirt following error is thrown: ibvirt.libvirtError: unsupported configuration: 'vhostuser' requires shared memory
This method of exposing NVMe local storage allows for much better I/O performance than exposing them from local filesystem.
if we were not to enable this natively in nova another option may be to do this with cyborg. cyborg has a spdk driver for this reason today not that its maintiaaned
https://github.com/openstack/cyborg/tree/master/cyborg/accelerator/drivers/s...
but the nova enableing of using vhost-user for block devices was never completed. so perhaps we could enable both usecase at the same time or collablerate on doing this via cybrog.
one of the usecase i want ot bring to cybrog going forward it the ablity to attach addtional local storage resocues to a vm, i was intially thinking of an lvm dirver and nvme namesapce driver to suplement the sdpk and ssd drivers that already exist.
I am not very familiar with with cyborg component, but from what I understand this would entirely separate spdk disk from nova's imagebackend, spdk-based disks would be specified using something like accel:device_profile and it would only work for additional disks, not root disk. Is this correct? Would doing it via cyborg be preferred? To be honest it would be easier for me to introduce the nova-based approach since we already have a version of it working in our environment. But if necessary I can look into how it would be done with cyborg.
We currently have working: creating, deleting, cold-migrating, shelving, snapshots, unshelving VMs with this storage backend. We don't have live migration working yet but have it in plans.
live migration i think woudl only require precreating the relevent spdk resouces in pre-livemigraiotn on the destionat host correct. it shoudl be possibel to generate stable vhost-user socket path but if not you woudl also need to update the live migration data object returned by prelive migration and extedn the xml update logic top moreify the paths when generated the migration xml.
as far as i am aware qemu/libvirt fully support live migration with spdk and have for a long time.
I will look into this, thank you. With this approach will data on the spdk disk be the same as before migration? Also do you feel that live migration support is needed in initial version of this feature?
This feature includes changes only to nova.
Description of Multiple ephemeral storage backend handling:
We add a new configuration option [libvirt]/supported_image_types to nova.conf of nova-compute and change the meaning of [libvirt]/images_type to mean default image type.
we can rat whole on the name but we woudl probaly not use supported_image_types as that can easilly be confused with the format of the image
i woudl generally propose deprecating [libvirt]/images_type
and intoducing it with a preferencally ordered list of storage backedns
``` [libvirt] storage_backend=spdk,rbd,qcow ```
i.e. spdk is the default follow by rbd followed by qcow
I agree that this approach looks better. We kept images_type in our version since we didn't want straying too far from the upstream in our changes.
that woudl allow you to have a new flavor extra spec hw:storage_backends=qcow,rbd meaning this flaovr woudl perfer to slect host with qcow sotrage first and rbd second but not use spdk i.e. because it does not use hugepages and wont work.
One problem I see with this approach is that placement does not currently allow alternatives in request for allocation candidates. If we wanted either qcow or rbd storage backend we would need two requests one for DISK_GB on RP with trait COMPUTE_STORAGE_BACKEND_QCOW and another for DISK_GB on RP with trait COMPUTE_STORAGE_BACKEND_RBD. And if we included different storage backends for root, swap, ephemeral we would need request for every possibility. But besides that this could probably also work.
If libvirt:image_type extra_spec is specified in flavor then VM is scheduled on compute with appropriate image type in its [libvirt]/supported_image_types. We use potentially multiple DISK_GB resource providers and construct appropriate request groups to handle this scheduling. ya this is similar ot what i said above i just think image_type is a bit to generic and misleading as it has nothign to do with the actual glance image. hence `hw:storage_backends` or jsut `hw:storage_backend`
we also have a naming convention which inlcude not using the name so software project in our extra specs
so instead of cyborg we use acclerator, so we woudl not us livbirt: as the namespace for an extra spec.
As the affect how we virutalaise the hardware preseetned to the vm it would make sese to use the hw: prefix/namespace
using hw: would alos technally allow other virt drivers to impelement it in the future in theory.
Yes, this would probably be a better name.
This spec is also used to fill driver_info.libvirt.image_type field of BDMs with destination_type=local of this VM. (driver_info is new JSON serialized field of BDM) Then in compute if BDM specifies this field we use its value instead of [libvirt]/images_type to decide on imagebackend to use for it. do you envision allwoing different dicsk ot use diffent storage backends.
nova can provision 3 types of disk today the root disk, 0-N addtional ephemeral disks and a swap disk
today we alwasy uses the saem storage backedn for all 3
so if you have images_type=rbd the swap and ephemeral disks are also created on ceph
the advantage of `hw:storage_backends=qcow,rbd` is we could supprot the following `hw:storage_backends=swap:qcow,ephmeral:spdk,root:rbd`
so the root disk woudl be backed by a ha ceph cluster for fault tollerance of a host enabling evacuation without dataloss. the swap disk which is rarely hevlliy used woudl use qcow sotrage providing a morderate level of perforamce which is approatate for swap with the ablity to oversubscribe sotraage as a benift since most of the time it wont be used. and finally you cha have high speed local scratch disk space backed by spdk. all in one vm.
to allowing diffent storage types(swap,root/ephemeral) to come form diffent storage backend is one of the key reasons why we shoudl evengualy supprot mutlipel stroage backend on a single host.
In our environment we don't really use additional ephemeral disks and swap disks always used same storage backend as root disk. But with some work those changes should be able to using different storage backends for different disks.
This method of handling multiple backends only works after being enabled by administrators by setting libvirt:image_type on all their flavors and running nova-manage commands that update existing VMs.
well we dont allow modifyitn the flavore via nova-manage today
the official way to update exciting instnace is to resize however we do nto currenlty supprot resizeign between sotrage backends. i think if we were to do this work we would want to require that.
i dont think we would provide a nova-manage command for this since nova manage in not really intended to be sued on the comptue node. it can be but the once command we added is sort of problematic but that a different topic.
We did this way since we wanted to allow reusing existing flavors which for us were already storage backend restricted (but based on aggregates, not traits and BDM properties) with new multibackend approach. So we would update flavors with libvirt:image_type property (not using nova-manage), then run the nova-mange script which would update libvirt:image_type stored in instance.flavor if needed and then update bdm.driver_info.libvirt.image_type. This way we could "upgrade" to using multibackend without affecting customers. In general we found instances without driver_info.libvirt.image_type set in its BDMs with destination_type=local to cause some issues if we enable [libvirt]/supported_image_types. For example it would be impossible to tell storage backend of BDM in pre_migrate on dest host. Or instances would unshelve with incorrect storage backend. This is why we decided that it would be better that setting driver_info.libvirt.image_type on all BDMs with destination_type=local should be a prerequisite to setting [libvirt]/supported_image_types to multiple values on any compute.