Re: [nova] Ephemeral storage potential features

8 Dec 2025

      Thank you for the response, Sean. Sorry for the late reply.

On 28.11.2025 18:16, Sean Mooney wrote:
...
On 28/11/2025 12:44, Karol Klimaszewski wrote:
...
Hello,
In my company (CloudFerro) we have developed two features related to
nova ephemeral storage (with libvirt virt driver) -
"SPDK-based ephemeral storage backend" and
"Multiple ephemeral storage backend handling" (both described below).
Would these features be appropriate to be added to upstream?
we would need to know a little more about them to say but in princiape 
both could be
one note on wording
ephamal sotage in nova referes pirmarlay to addtion no root disks 
allocated by nova for a vm
i.e. flaovor.epheral_gb
general sotage that is provided by nova is "nova provisioned storage"
i avoid using the term ephemeral storage to refer to the nova vm root 
disk.
OK, thanks for the clarification!
...
...
If yes, then should I start with creating a blueprint and proposing
a spec for each of them so it is more clear what we want to introduce?
Of course not both at once, since there is some code that makes
one work with the other. For us it would probably be best if we
upstreamed "SPDK-based ephemeral storage backend" first and then
upstreamed "Multiple ephemeral storage backend handling"
ack so you woudl first like to add a new images_type backend for using 
vhost-user
wiht virtio blk device back by SPDK correct? that woudl allow you to have
some host with spdk backed storage and other with rbd  or qcow for 
examples
then once that capablity is avaiabel work on allowing a single host to 
have multiple stroage backends
enabled at the same time so you nolonger need to partion your cloud 
into diffent hosts of
different stoage backend correct?
Yes, that is correct.
...
...
Description of SPDK-based ephemeral storage backend:
We add a new possible value for [libvirt]/images_type
in nova.conf: spdk_lvol.
If this value is set then local disks of instances are handled
as logical volumes (lvols). of locally run SPDK instance (on the same
compute as VM). See https://spdk.io/doc/logical_volumes.html#lvol for
docs on this subject.
These are essentially a part of local NVMe disk managed by SPDK.
SPDK was created as a spin off form DPDK so while its been a few years 
i was quite
familar with it when i used to work at intel.
it can operate on over nvme ssd but its also possibel to use it with 
hardrives
importantly that also means you can actully deploy it in a vm in the 
ci and it can use a loopback blockdevice or similar
as a backign store for testing.
that woudl obviouly not be corect for production usage but it means 
there is not specific hardware requrieemnt
to deploy this opensouce sotorage solution with devstack or in our ci.
We didn't test such approach but this should work. Especially since our 
solution works by creating lvols on
top of lvstores that were created prior to nova-compute startup. We do 
not care whether they are on top
of NVMe disk, aio bdev running on a harddrive or a loopback bdev. Some 
work would probably be
needed to setup SPDK vhost_tgt instance running in a devstack however.
...
...
We create and manage those lvols by making calls from nova-compute
to SPDK instance with RPC (https://spdk.io/doc/jsonrpc.html)
using provided python library
(https://github.com/spdk/spdk/blob/master/python/README.md).
We attach those lvols to instances by exposing them as vhost-blk devices
(see https://spdk.io/doc/vhost.html) and specifying them as disks
with source_type='vhostuser'.
yes this is usign the same transpaort as we use for virtio-net devices 
with dpdk
and ovs.
one of the requirement for this to work is that you have hared memory 
access
which mean using file backed memory, huge-pages or memfd. the former 
to are already supported
in nova but https://review.opendev.org/c/openstack/nova-specs/+/951689 
would make it much more
user friendly for operators and end-users.
On our end we used huge-pages based shared memory and its true that it 
was a bit problematic
to use. Thanks for the heads up on memfd, that for sure will be useful 
for spdk based instances
and we will definitely look into this on our side.
...
without using one of those alternative memory modes the vm will start 
but the spdk appcltion will
not be able to map the guest memory regions to implement its part of 
the vhost-user portocol.
From our experience trying to run vhostuser devices without shared 
memory is prevented by libvirt
following error is thrown:
ibvirt.libvirtError: unsupported configuration: 'vhostuser' requires 
shared memory
...
...
This method of exposing NVMe local storage allows for much better I/O
performance than exposing them from local filesystem.
if we were not to enable this natively in nova another option may be to
do this with cyborg.
cyborg has a spdk driver for this reason today not that its maintiaaned
https://github.com/openstack/cyborg/tree/master/cyborg/accelerator/drivers/s...
but the nova enableing of using vhost-user for block devices was never 
completed.
so perhaps we could enable both usecase at the same time or 
collablerate on doing this via cybrog.
one of the usecase i want  ot bring to cybrog going forward it the 
ablity to attach addtional
local storage resocues to a vm, i was intially thinking of an lvm 
dirver and nvme namesapce driver to suplement
the sdpk and ssd drivers that already exist.
I am not very familiar with with cyborg component, but from what I 
understand this would entirely separate
spdk disk from nova's imagebackend, spdk-based disks would be specified 
using something like accel:device_profile
and it would only work for additional disks, not root disk. Is this correct?
Would doing it via cyborg be preferred? To be honest it would be easier 
for me to introduce the nova-based approach
since we already have a version of it working in our environment. But if 
necessary I can look into how it would be done
with cyborg.
...
...
We currently have working: creating, deleting, cold-migrating, shelving,
snapshots, unshelving VMs with this storage backend.
We don't have live migration working yet but have it in plans.
live migration i think woudl only require precreating the relevent 
spdk resouces  in pre-livemigraiotn
on the destionat host correct. it shoudl be possibel to generate 
stable vhost-user socket path but if not you
woudl also need to update the live migration data object returned by 
prelive migration and extedn the xml update logic
top moreify the paths when generated the migration xml.
as far as i am aware qemu/libvirt fully support live migration with 
spdk and have for a long time.
I will look into this, thank you. With this approach will data on the 
spdk disk be the same as before migration?
Also do you feel that live migration support is needed in initial 
version of this feature?
...
...
This feature includes changes only to nova.
Description of Multiple ephemeral storage backend handling:
We add a new configuration option [libvirt]/supported_image_types to
nova.conf of nova-compute and change the meaning of
[libvirt]/images_type to mean default image type.
we can rat whole on the name but we woudl probaly not use 
supported_image_types
as that can easilly be confused with the format of the image
i woudl generally propose deprecating [libvirt]/images_type
and intoducing it with a preferencally ordered list of storage backedns
```
[libvirt]
storage_backend=spdk,rbd,qcow
```
i.e. spdk is the default follow by rbd followed by qcow
I agree that this approach looks better. We kept images_type in our 
version since
we didn't want straying too far from the upstream in our changes.
...
that woudl allow you to have a new flavor extra spec 
hw:storage_backends=qcow,rbd
meaning this flaovr woudl perfer to slect host with qcow sotrage first 
and rbd second but not use spdk i.e.
because it does not use hugepages and wont work.
One problem I see with this approach is that placement does not 
currently allow alternatives in request
for allocation candidates. If we wanted either qcow or rbd storage 
backend we would need two requests
one for DISK_GB on RP with trait COMPUTE_STORAGE_BACKEND_QCOW and 
another for DISK_GB on RP
with trait COMPUTE_STORAGE_BACKEND_RBD. And if we included different 
storage backends for root, swap,
ephemeral we would need request for every possibility.

But besides that this could probably also work.
...
...
If libvirt:image_type extra_spec is specified in flavor then
VM is scheduled on compute with appropriate image type  in its
[libvirt]/supported_image_types. We use potentially multiple DISK_GB
resource providers and construct appropriate request groups to handle
this scheduling.
ya this is similar ot what i said above i just think image_type is a 
bit to generic and misleading
as it has nothign to do with the actual glance image. hence 
`hw:storage_backends` or jsut `hw:storage_backend`
we also have a naming convention which inlcude not using the name so 
software project in our extra specs
so instead of cyborg we use acclerator, so we woudl not us livbirt: as 
the namespace for an extra spec.
As the affect how we virutalaise the hardware preseetned to the vm it 
would make sese to use the hw: prefix/namespace
using hw: would alos technally allow other virt drivers to impelement 
it in the future in theory.
Yes, this would probably be a better name.
...
...
This spec is also used to fill driver_info.libvirt.image_type field
of BDMs with destination_type=local of this VM.
(driver_info is new JSON serialized field of BDM)
Then in compute if BDM specifies this field we use its value instead
of [libvirt]/images_type to decide on imagebackend to use for it.
do you envision allwoing different dicsk ot use diffent storage backends.
nova can provision 3 types of disk today
the root disk, 0-N addtional ephemeral disks and a swap disk
today we alwasy uses the saem storage backedn for all 3
so if you have images_type=rbd the swap and ephemeral disks are also 
created on ceph
the advantage of `hw:storage_backends=qcow,rbd` is we could supprot 
the following `hw:storage_backends=swap:qcow,ephmeral:spdk,root:rbd`
so the root disk woudl be backed by a ha ceph cluster for fault 
tollerance of a host enabling evacuation without dataloss.
the swap disk which is rarely hevlliy used woudl use qcow sotrage 
providing a morderate level of perforamce which is
approatate for swap with the ablity to oversubscribe sotraage as a 
benift since most of the time it wont be used.
and finally you cha have high speed local scratch disk space backed by 
spdk.
all in one vm.
to allowing diffent storage types(swap,root/ephemeral) to come form 
diffent storage backend is one of the key reasons
why we shoudl evengualy supprot mutlipel stroage backend on a single 
host.
In our environment we don't really use additional ephemeral disks and 
swap disks always used
same storage backend as root disk. But with some work those changes 
should be able to
using different storage backends for different disks.
...
...
This method of handling multiple backends only works after being enabled
by administrators by setting libvirt:image_type on all their flavors
and running nova-manage commands that update existing VMs.
well we dont allow modifyitn the flavore via nova-manage today
the official way to update exciting instnace is to resize however
we do nto currenlty supprot resizeign between sotrage backends.
i think if we were to do this work we would want to require that.
i dont think we would provide a nova-manage command for this since nova
manage in not really intended to be sued on the comptue node.
it can be but the once command we added is sort of problematic but that a
different topic.
We did this way since we wanted to allow reusing existing flavors
which for us were already storage backend restricted (but based
on aggregates, not traits and BDM properties) with new multibackend
approach.

So we would update flavors with libvirt:image_type property (not using 
nova-manage),
then run the nova-mange script which would update libvirt:image_type 
stored in
instance.flavor if needed and then update 
bdm.driver_info.libvirt.image_type.

This way we could "upgrade" to using multibackend without affecting 
customers.

In general we found instances without driver_info.libvirt.image_type set 
in its BDMs
with destination_type=local to cause some issues if we enable 
[libvirt]/supported_image_types.
For example it would be impossible to tell storage backend of BDM in 
pre_migrate on dest host.
Or instances would unshelve with incorrect storage backend.

This is why we decided that it would be better that setting 
driver_info.libvirt.image_type on all BDMs
with destination_type=local should be a prerequisite to setting 
[libvirt]/supported_image_types to
multiple values on any compute.