Re: [nova] Ephemeral storage potential features

26 Jan 2026

      Hey @smooney,

I suspect you might've missed my last email in this thread.
It was held in moderation for a bit or it's possible it got lost due to 
holiday break.
In any case I am attaching it here again, since I'd appreciate your 
input on some
points. And if you feel like "SPDK-based nova-provisioned storage backend"
feature is ready for blueprint & spec proposal already, please let me know.

I removed some nested parts of discussion from this email to make sure 
it does
not get held up by moderation again due to size.

On 31.12.2025 13:25, Karol Klimaszewski wrote:
...
On 22.12.2025 16:37, Sean Mooney wrote:
...
On 22/12/2025 11:39, Karol Klimaszewski wrote:
...
On 8.12.2025 15:35, Sean Mooney wrote:
...
(...)
Not sure if this is exactly what you mean but SPDK allows creating 
bdevs from VirtIO-Block devices as
described here: https://spdk.io/doc/bdev.html#bdev_config_virtio_blk.
There is also https://spdk.io/doc/virtio.html library for SPDK 
applications.
I can't really say much about it though, since we have not used it 
at CloudFerro yet.
no the vhost-user protocol define a client and a server.
https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#introduction
we strongly perfer for QEMU to be the server which is the process 
that creates the unix socket
and DPDK or in this case SPDK woudl be the clinet the process that 
connect to the socket.
the reason for this is if QEMU is the vhost-user socket server and 
you ahve to restart the spdk process it can just reconnect.
if the spdk process is the server qemu needs to be restarted to 
reconnect if the spdk backedn is restarted.
this is potically less of an issue for spdk then ovs-dpdk but its an 
important point for upgrades.
you need to ensure the spdk binary can be upgraded without impacting 
the workloads or at least by minimising it.
ovs-dpdk started with ovs as the server adn then we move to haveing 
qemu be the server so you coudl upgrade ovs without needing to
restart all vms
https://docs.openvswitch.org/en/latest/topics/dpdk/vhost-user/#vhost-user-vs...
its covered a little here https://spdk.io/doc/vhost_processing.html
so in that parlance we  perfer qemu to be the fruntend and server and 
spdk or dpdk ot be the backend and client.
Ok, I understand. Unfortunately I does not seem that SPDK supports 
this approach.
In the docs (https://spdk.io/doc/vhost_processing.html) it is only 
said that:
"SPDK vhost is a Vhost-user back-end server. It exposes Unix domain 
sockets and allows external applications to connect."
However it seems SPDK has a solution to the restart issue you are 
mentioning.
I found this github issue: https://github.com/spdk/spdk/issues/1127.
There it's mentioned that patchset 
https://review.gerrithub.io/c/spdk/spdk/+/471235 (present since SPDK 
20.01)
solves that problem. SPDK supports vhost-blk live recovery feature 
which means
that even if SPDK process crashes or restarts, VMs using SPDK 
vhostuser devices can reconnect without restarting.
We checked this on one of our environments (with SPDK 24.09) and found 
it to be working correctly.
We do however need to dump vhost controllers at service stop (using 
spdk-rpc save_subsystem_config -n vhost-blk)
and recreate them after spdk has restarted (using spdk-rpc 
load_subsystem_config -n vhost-blk).
They are (like nmve controllers and lvstores) not persisted across 
restarts. Only lvols are.
Instruction how to do this (or even entire example spdk setup) could 
probably be mentioned in documentation
that would accompany this feature.
...
...
...
(...)
OK, understood. So what really is needed is for code introduced in 
this feature to contain libvirt config-related code for
vhost-user block devices that is generic enough to be used for both 
nova image backend I will add, and for potential
cinder/cyborg usage of vhost-user block devices. This is definitely 
something that can be done.
yes, we don't necessarily need too expand the scope of your work too 
boradly but just keep in mind that he cofnig generation should
be relitively generic i.e. dont calulate the unix socket path in the 
config cod ebut rather pass it in form the libvirt driver to the 
config classes
(...)
Right, that solves this issue. I completely forgot about the in 
operator, sorry.
So the request for allocation candidates could include something like:
resources=VCPU=1,MEMORY_MB=4096
&resources_ROOT_DISK=DISK_GB=20
&required_ROOT_DISK=in:COMPUTE_STORAGE_BACKEND_QCOW,COMPUTE_STORAGE_BACKEND_RBD
&resources_SWAP_DISK=DISK_GB=8
&required_SWAP_DISK=COMPUTE_STORAGE_BACKEND_QCOW
&resources_EPHEMERAL_DISK=DISK_GB=40
&required_SWAP_DISK=COMPUTE_STORAGE_BACKEND_SPDK
yep somethign like that shoudl work.
i say shoudl because we added in but have not really depended on it 
yet for any nova feature.
...
...
(...)
Understood, so I think it would be better to start without this 
being supported but introduce it later in a separate
blueprint&spec.
+1 you could note it as a future use-case but yes i would put it out 
of scope of the initial work.
we just need to ensure that we dont prevent exentenidng the design to 
supprot it later.
(...)
This is something we are already doing and will be included in 
initial version of "Multiple nova-provisioned storage backend 
handling".
To avoid unnecessary DISK_GB allocation migrations we kept DISK_GB 
of default storage backend in root resource provider.
This also means that if additional storage backends are not 
specified by the operator then the shape of the provider tree
doesn't change - only appropriate storage backend trait is added to 
the root resource provider.
Do you feel like this is a good approach? Or would it be better to 
always move DISK_GB of default storage backend into a separate RP?
you can use it via the flavor directly but what that really means is 
we dont knwo if there are any
scaling concerns like the subtree issues ye found.
there are trade offs, i kind of fell like having a single tree 
topology is easier to reason about.
so while we might support both toplogies for a tansition perioid i 
think we woudl eventually want to converge on a singel toploty.
if we intend to supprot multiple backend evntually that woudl require 
moving to having storage in a nested RP so after a release or two
we woudl likely wanto to make that the toplogy even when you have 
only one storage backend.
To clear things up: There is a single, unified tree topology here, 
both in the case of one storage backend and many.
With one storage backend (e.g. qcow2):
* compute node RP with inventory (DISK_GB, VCPU, MEMORY_MB) and traits 
(COMPUTE_STORAGE_BACKEND_QCOW2 and others)
With many storage backends (e.g. qcow2, raw, rbd, spdk):
* compute node RP with inventory (DISK_GB, VCPU, MEMORY_MB) and traits 
(COMPUTE_STORAGE_BACKEND_QCOW2,
   COMPUTE_STORAGE_BACKEND_RAW and others)
* child of compute node RP with inventory (DISK_GB) and traits 
(COMPUTE_STORAGE_BACKEND_RBD)
* child of compute node RP with inventory (DISK_GB) and traits 
(COMPUTE_STORAGE_BACKEND_SPDK)
When adding new storage backends provider tree is only extended not 
transformed. However considering your next point:
that host could use 2 ceph clusters with separate providers, maybe it 
is correct that having this always separate from the root rp would be 
simpler.
So:
With one storage backend (e.g. qcow2):
* compute node RP with inventory (VCPU, MEMORY_MB) and traits (...)
* child of compute node RP with inventory (DISK_GB) and traits 
(COMPUTE_STORAGE_BACKEND_QCOW2)
With many storage backends (e.g. qcow2, raw, rbd, spdk):
* compute node RP with inventory (VCPU, MEMORY_MB) and traits (...)
* child of compute node RP with inventory (DISK_GB) and traits 
(COMPUTE_STORAGE_BACKEND_QCOW2, COMPUTE_STORAGE_BACKEND_RAW)
* child of compute node RP with inventory (DISK_GB) and traits 
(COMPUTE_STORAGE_BACKEND_RBD)
* child of compute node RP with inventory (DISK_GB) and traits 
(COMPUTE_STORAGE_BACKEND_SPDK)
From the perspective of scheduler both of those are the same - for 
both the same request for allocation candidates will be made.
What do you mean by supporting both topologies? Won't the nova-compute 
virt driver be the one that decides which topology is used?
So to operator it shouldn't matter which topology is used. Or do we 
want to do this to allow rollbacks to previous releases?
...
...
We also put storage backend used by the disk in 
BDM.driver_info.libvirt.image_type since it is impossible to 
determine it just by
looking at placement allocations - for example storage backends 
qcow2 and raw will share one resource provider since they use
the same "storage pool" - the filesystem one. RPs in this feature 
are built around those "storage pools" instead of
just storage backends for that reason.
ya so even if  its the same backend driver it might be sperate 
storage cluster. i.d. 2 host might use rbd but use different ceph 
cluster
so the representation in placement may not be 1:1 with drivers 
enabled and we will have to model things appropriately in placement 
to make sure
we dont oversubscribe/report
...
...
(...)
Ok, understood, thank you. In our version we made it so that for 
flavors that don't specify storage backend,
instances use default storage backend of the compute (by requiring 
EXTRA_STORAGE_BACKEND trait to be absent on DISK_GB RP for candidates).
But this is probably a better approach, especially since it is more 
obvious and works with instances retaining their storage backend on 
migration.
I am worried however about existing shelved instances.
For existing running instances we can ensure that they keep their 
original storage backend
thanks to encoding their current storage backend (default storage 
backend of their compute)
in their BDM objects on first nova-compute startup after update.
im not sure if we will want to use the BDMs for that but its one of 
the logical placew we coudl store it.
I decided on placing this information in BDM based on comments to one 
of the previous attempts to introduce this feature.
See 
https://review.opendev.org/c/openstack/nova-specs/+/363547/4#message-2cb6e5a....
This also feels like the most natural place for it since:
* It should be in database to be accessible for example from another 
host during migration
* It should be part of BDM since it is a property of a specific disk 
of an instance, not the instance as a whole.
* It should be in a virt driver specific subfield 
(driver_info.libvirt) because not all virt drivers might support it
   or they might handle storage backends in a different way.
But I am open to other approaches, if there is a better place for it.
...
...
But shelved instances don't belong to any compute. It is impossible 
to determine what storage backend
they should be using since before it was implicitly decided based on 
where they ended up on.
ya so in general shelve isntanc have there disk stored in glance as 
an image and the format does not matter
the excption to this is images_type=rbd.
shelve and unshilve is the only supproted way to move between straoge 
backedn today. i.e form qcow -> lvm ->raw -> ceph
today however once you end up on ceph you cant use shelve to move out 
of it again.
between qemu and lvm and raw backend you can use shelve ot mvoe 
between them in any order.
so in general shelved insance are not tied to any storage backend.
with the cavet that ceph has some limitation today.
...
But with this feature even if they end up on the same compute they 
were on before shelving
they could use a different storage backend. This would probably have 
to be mentioned in upgrade impact section:
"If compute hosts are updated to allow more storage backends then 
shelved instances which were allowed to
be scheduled on those hosts could have their storage backend be 
different than their original one".
so this cna work tody but that because when we shelve we snapshot the 
root disk and upload it as a new image to glance
which means when we unshleve we can convert it to any stroage backend 
format as if it was just booting a new vm form an image.
for ceph we can t generally download the entire disk data from glance 
to do that which is the reason we can genreally move to cpeh but
cant nessisarly move form ceph to somethign else.
these are all design constratints that willl need to be disucsed and 
capatured in the spec when you get to that point
i.e. to defien what is and is not expected to work.
Ok, I understand that there is a use case that shelve and unshelve 
should allow changing used storage backend.
But I feel like is there is another side to this. I don't know how 
supported and how common this approach is
(I obviously have more limited viewpoint on this than you, since I 
only know how things are at my workplace),
but at least in our company we have following use case:
We want for instances with specific flavors to only use specific 
storage backends.
This is due differences of different storage backends related to 
limitations (limited mobility of local storage backends e.g. when 
evacuating),
performance (SPDK is way faster than RBD so is suited for different 
workloads) and upkeep costs.
Now with introduction of "Multiple nova-provisioned storage backend 
handling" feature this will be officially supported
by setting hw:storage_backend extra spec on flavor. But even before 
introduction of this feature there was a way to do this
by limiting such flavors to specific aggregates or enforcing this 
using custom traits.
However with introduction of "Multiple nova-provisioned storage 
backend handling" operator might want to multipurpose
hosts in those aggregates to also handle other storage backends. And 
this will break support for this "workaround" restriction
on storage backends of flavors.
Now, do we say that such workaround was never supported in the first 
place, do we add some special transition logic (maybe a nova-manage
command) or do we highlight that existing hosts shouldn't have their 
storage_backends field expanded
(if such "workaround" is used by the operator) since there might be 
some shelved instances that might be scheduled incorrectly.
The last one feels very restrictive and like it could discourage 
operators from using multiple storage backends at all.
But maybe it is the best. What is your opinion on this?