Hey @smooney, I suspect you might've missed my last email in this thread. It was held in moderation for a bit or it's possible it got lost due to holiday break. In any case I am attaching it here again, since I'd appreciate your input on some points. And if you feel like "SPDK-based nova-provisioned storage backend" feature is ready for blueprint & spec proposal already, please let me know. I removed some nested parts of discussion from this email to make sure it does not get held up by moderation again due to size. On 31.12.2025 13:25, Karol Klimaszewski wrote:
On 22.12.2025 16:37, Sean Mooney wrote:
On 22/12/2025 11:39, Karol Klimaszewski wrote:
On 8.12.2025 15:35, Sean Mooney wrote:
(...) Not sure if this is exactly what you mean but SPDK allows creating bdevs from VirtIO-Block devices as described here: https://spdk.io/doc/bdev.html#bdev_config_virtio_blk. There is also https://spdk.io/doc/virtio.html library for SPDK applications. I can't really say much about it though, since we have not used it at CloudFerro yet.
no the vhost-user protocol define a client and a server. https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#introduction
we strongly perfer for QEMU to be the server which is the process that creates the unix socket and DPDK or in this case SPDK woudl be the clinet the process that connect to the socket.
the reason for this is if QEMU is the vhost-user socket server and you ahve to restart the spdk process it can just reconnect. if the spdk process is the server qemu needs to be restarted to reconnect if the spdk backedn is restarted.
this is potically less of an issue for spdk then ovs-dpdk but its an important point for upgrades. you need to ensure the spdk binary can be upgraded without impacting the workloads or at least by minimising it.
ovs-dpdk started with ovs as the server adn then we move to haveing qemu be the server so you coudl upgrade ovs without needing to restart all vms https://docs.openvswitch.org/en/latest/topics/dpdk/vhost-user/#vhost-user-vs...
its covered a little here https://spdk.io/doc/vhost_processing.html
so in that parlance we perfer qemu to be the fruntend and server and spdk or dpdk ot be the backend and client.
Ok, I understand. Unfortunately I does not seem that SPDK supports this approach. In the docs (https://spdk.io/doc/vhost_processing.html) it is only said that:
"SPDK vhost is a Vhost-user back-end server. It exposes Unix domain sockets and allows external applications to connect."
However it seems SPDK has a solution to the restart issue you are mentioning. I found this github issue: https://github.com/spdk/spdk/issues/1127. There it's mentioned that patchset https://review.gerrithub.io/c/spdk/spdk/+/471235 (present since SPDK 20.01) solves that problem. SPDK supports vhost-blk live recovery feature which means that even if SPDK process crashes or restarts, VMs using SPDK vhostuser devices can reconnect without restarting.
We checked this on one of our environments (with SPDK 24.09) and found it to be working correctly.
We do however need to dump vhost controllers at service stop (using spdk-rpc save_subsystem_config -n vhost-blk) and recreate them after spdk has restarted (using spdk-rpc load_subsystem_config -n vhost-blk). They are (like nmve controllers and lvstores) not persisted across restarts. Only lvols are. Instruction how to do this (or even entire example spdk setup) could probably be mentioned in documentation that would accompany this feature.
(...) OK, understood. So what really is needed is for code introduced in this feature to contain libvirt config-related code for vhost-user block devices that is generic enough to be used for both nova image backend I will add, and for potential cinder/cyborg usage of vhost-user block devices. This is definitely something that can be done. yes, we don't necessarily need too expand the scope of your work too boradly but just keep in mind that he cofnig generation should be relitively generic i.e. dont calulate the unix socket path in the config cod ebut rather pass it in form the libvirt driver to the config classes (...) Right, that solves this issue. I completely forgot about the in operator, sorry. So the request for allocation candidates could include something like: resources=VCPU=1,MEMORY_MB=4096 &resources_ROOT_DISK=DISK_GB=20 &required_ROOT_DISK=in:COMPUTE_STORAGE_BACKEND_QCOW,COMPUTE_STORAGE_BACKEND_RBD
&resources_SWAP_DISK=DISK_GB=8 &required_SWAP_DISK=COMPUTE_STORAGE_BACKEND_QCOW &resources_EPHEMERAL_DISK=DISK_GB=40 &required_SWAP_DISK=COMPUTE_STORAGE_BACKEND_SPDK yep somethign like that shoudl work.
i say shoudl because we added in but have not really depended on it yet for any nova feature.
(...) Understood, so I think it would be better to start without this being supported but introduce it later in a separate blueprint&spec. +1 you could note it as a future use-case but yes i would put it out of scope of the initial work. we just need to ensure that we dont prevent exentenidng the design to supprot it later. (...) This is something we are already doing and will be included in initial version of "Multiple nova-provisioned storage backend handling". To avoid unnecessary DISK_GB allocation migrations we kept DISK_GB of default storage backend in root resource provider. This also means that if additional storage backends are not specified by the operator then the shape of the provider tree doesn't change - only appropriate storage backend trait is added to the root resource provider. Do you feel like this is a good approach? Or would it be better to always move DISK_GB of default storage backend into a separate RP?
you can use it via the flavor directly but what that really means is we dont knwo if there are any scaling concerns like the subtree issues ye found. there are trade offs, i kind of fell like having a single tree topology is easier to reason about. so while we might support both toplogies for a tansition perioid i think we woudl eventually want to converge on a singel toploty. if we intend to supprot multiple backend evntually that woudl require moving to having storage in a nested RP so after a release or two we woudl likely wanto to make that the toplogy even when you have only one storage backend. To clear things up: There is a single, unified tree topology here, both in the case of one storage backend and many.
With one storage backend (e.g. qcow2): * compute node RP with inventory (DISK_GB, VCPU, MEMORY_MB) and traits (COMPUTE_STORAGE_BACKEND_QCOW2 and others)
With many storage backends (e.g. qcow2, raw, rbd, spdk):
* compute node RP with inventory (DISK_GB, VCPU, MEMORY_MB) and traits (COMPUTE_STORAGE_BACKEND_QCOW2, COMPUTE_STORAGE_BACKEND_RAW and others) * child of compute node RP with inventory (DISK_GB) and traits (COMPUTE_STORAGE_BACKEND_RBD) * child of compute node RP with inventory (DISK_GB) and traits (COMPUTE_STORAGE_BACKEND_SPDK)
When adding new storage backends provider tree is only extended not transformed. However considering your next point: that host could use 2 ceph clusters with separate providers, maybe it is correct that having this always separate from the root rp would be simpler. So:
With one storage backend (e.g. qcow2): * compute node RP with inventory (VCPU, MEMORY_MB) and traits (...) * child of compute node RP with inventory (DISK_GB) and traits (COMPUTE_STORAGE_BACKEND_QCOW2)
With many storage backends (e.g. qcow2, raw, rbd, spdk):
* compute node RP with inventory (VCPU, MEMORY_MB) and traits (...) * child of compute node RP with inventory (DISK_GB) and traits (COMPUTE_STORAGE_BACKEND_QCOW2, COMPUTE_STORAGE_BACKEND_RAW) * child of compute node RP with inventory (DISK_GB) and traits (COMPUTE_STORAGE_BACKEND_RBD) * child of compute node RP with inventory (DISK_GB) and traits (COMPUTE_STORAGE_BACKEND_SPDK)
From the perspective of scheduler both of those are the same - for both the same request for allocation candidates will be made. What do you mean by supporting both topologies? Won't the nova-compute virt driver be the one that decides which topology is used? So to operator it shouldn't matter which topology is used. Or do we want to do this to allow rollbacks to previous releases?
We also put storage backend used by the disk in BDM.driver_info.libvirt.image_type since it is impossible to determine it just by looking at placement allocations - for example storage backends qcow2 and raw will share one resource provider since they use the same "storage pool" - the filesystem one. RPs in this feature are built around those "storage pools" instead of just storage backends for that reason. ya so even if its the same backend driver it might be sperate storage cluster. i.d. 2 host might use rbd but use different ceph cluster
so the representation in placement may not be 1:1 with drivers enabled and we will have to model things appropriately in placement to make sure we dont oversubscribe/report
(...) Ok, understood, thank you. In our version we made it so that for flavors that don't specify storage backend, instances use default storage backend of the compute (by requiring EXTRA_STORAGE_BACKEND trait to be absent on DISK_GB RP for candidates). But this is probably a better approach, especially since it is more obvious and works with instances retaining their storage backend on migration.
I am worried however about existing shelved instances.
For existing running instances we can ensure that they keep their original storage backend thanks to encoding their current storage backend (default storage backend of their compute) in their BDM objects on first nova-compute startup after update. im not sure if we will want to use the BDMs for that but its one of the logical placew we coudl store it.
I decided on placing this information in BDM based on comments to one of the previous attempts to introduce this feature. See https://review.opendev.org/c/openstack/nova-specs/+/363547/4#message-2cb6e5a.... This also feels like the most natural place for it since: * It should be in database to be accessible for example from another host during migration * It should be part of BDM since it is a property of a specific disk of an instance, not the instance as a whole. * It should be in a virt driver specific subfield (driver_info.libvirt) because not all virt drivers might support it or they might handle storage backends in a different way. But I am open to other approaches, if there is a better place for it.
But shelved instances don't belong to any compute. It is impossible to determine what storage backend they should be using since before it was implicitly decided based on where they ended up on.
ya so in general shelve isntanc have there disk stored in glance as an image and the format does not matter the excption to this is images_type=rbd.
shelve and unshilve is the only supproted way to move between straoge backedn today. i.e form qcow -> lvm ->raw -> ceph today however once you end up on ceph you cant use shelve to move out of it again. between qemu and lvm and raw backend you can use shelve ot mvoe between them in any order.
so in general shelved insance are not tied to any storage backend. with the cavet that ceph has some limitation today.
But with this feature even if they end up on the same compute they were on before shelving they could use a different storage backend. This would probably have to be mentioned in upgrade impact section: "If compute hosts are updated to allow more storage backends then shelved instances which were allowed to be scheduled on those hosts could have their storage backend be different than their original one".
so this cna work tody but that because when we shelve we snapshot the root disk and upload it as a new image to glance which means when we unshleve we can convert it to any stroage backend format as if it was just booting a new vm form an image. for ceph we can t generally download the entire disk data from glance to do that which is the reason we can genreally move to cpeh but cant nessisarly move form ceph to somethign else.
these are all design constratints that willl need to be disucsed and capatured in the spec when you get to that point i.e. to defien what is and is not expected to work.
Ok, I understand that there is a use case that shelve and unshelve should allow changing used storage backend. But I feel like is there is another side to this. I don't know how supported and how common this approach is (I obviously have more limited viewpoint on this than you, since I only know how things are at my workplace), but at least in our company we have following use case: We want for instances with specific flavors to only use specific storage backends. This is due differences of different storage backends related to limitations (limited mobility of local storage backends e.g. when evacuating), performance (SPDK is way faster than RBD so is suited for different workloads) and upkeep costs.
Now with introduction of "Multiple nova-provisioned storage backend handling" feature this will be officially supported by setting hw:storage_backend extra spec on flavor. But even before introduction of this feature there was a way to do this by limiting such flavors to specific aggregates or enforcing this using custom traits. However with introduction of "Multiple nova-provisioned storage backend handling" operator might want to multipurpose hosts in those aggregates to also handle other storage backends. And this will break support for this "workaround" restriction on storage backends of flavors.
Now, do we say that such workaround was never supported in the first place, do we add some special transition logic (maybe a nova-manage command) or do we highlight that existing hosts shouldn't have their storage_backends field expanded (if such "workaround" is used by the operator) since there might be some shelved instances that might be scheduled incorrectly. The last one feels very restrictive and like it could discourage operators from using multiple storage backends at all. But maybe it is the best. What is your opinion on this?