[nova][cinder] Providing ephemeral storage to instances - Cinder or Nova

Sean Mooney smooney at redhat.com
Mon Mar 27 12:20:04 UTC 2023


On Mon, 2023-03-27 at 10:47 +0200, Christian Rohmann wrote:
> Thanks for your extensive reply Sean!
> 
> I also replied inline and would love to continue the conversation with 
> you and other with this use case
> to find the best / most suitable approach.
> 
> 
> On 24/03/2023 19:50, Sean Mooney wrote:
> > i responed in line but just a waring this is a usecase we ahve heard before.
> > there is no simple option im afraid and there are many many sharp edges
> > and severl littel know features/limitatiosn that your question puts you right in the
> > middel of.
> > 
> > On Fri, 2023-03-24 at 16:28 +0100, Christian Rohmann wrote:
> > > 1) *Via Nova* instance storage and the configurable "ephemeral" volume
> > > for a flavor
> > > 
> > > a) We currently use Ceph RBD als image_type
> > > (https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.images_type),
> > > so instance images are stored in Ceph, not locally on disk. I believe
> > > this setting will also cause ephemeral volumes (destination_local) to be
> > > placed on a RBD and not /var/lib/nova/instances?
> > it should be in ceph yes we do not support havign the root/swap/ephemral
> > disk use diffent storage locatiosn
> > > Or is there a setting to set a different backend for local block devices
> > > providing "ephemeral" storage? So RBD for the root disk and a local LVM
> > > VG for ephemeral?
> > no that would be a new feature and not a trivial one as yo uwould have to make
> > sure it works for live migration and cold migration.
> 
> While having the root disk on resilient storage, using local storage 
> swap / ephemeral actually seems quite obvious.
> Do you happen to know if there ever was a spec / push to implement this?
as far as i am aware no.

but if we were to have one i would do it as an api option basically the inverse of 
https://specs.openstack.org/openstack/nova-specs/specs/xena/approved/allow-migrate-pmem-data.html

For PMEM instance we defautl to not copying the possibel multiple TB of PMEM over the network on cold migrate.
later we added that option as an api paramter.

Swap is not coppied for cold migration today but is for live for obvious reasons.

like the  “copy_pmem_devices”: “true” option i woudl be fine wiht adding 
 “copy_ephmeral_devices”: “true|false”.

We woudl proably need to default to copying the data but we coudl discuss that in the spec.


> 
> 
> > > b) Will an ephemeral volume also be migrated when the instance is
> > > shutoff as with live-migration?
> > its hsoudl be. its not included in snapshots so its not presergved
> > when shelving. that means corss cell cold migration will not preserve the disk.
> > 
> > but for a normal cold migration it shoudl be scp'd or rsynced with the root disk
> > if you are using the raw/qcow/flat images type if i remember correctly.
> > with RBD or other shared storage like nfs it really sould be preserved.
> > 
> > one other thing to note is ironic and only ironic support the
> > preserve_ephemeral option in the rebuild api.
> > 
> > libvirt will wipte the ephmeral disk if you rebuild or evacuate.
> 
> Could I somehow configure a flavor to "require" a rebuild / evacuate or  
> to disable live migration for it?
rebuild is not a move operation so that wont help you move the instance and evacuate is admin only and required you to ensure
teh instance is not runnign before its used.
disabling live migration is something that you can do via custom policy but its admin only by default as well.
> 
> 
> > > Or will there be an new volume created on the target host? I am asking
> > > because I want to avoid syncing 500G or 1T when it's only "ephemeral"
> > > and the instance will not expect any data on it on the next boot.
> > i would perssonally consider it a bug if it was not transfered.
> > that does not mean that could not change in the future.
> > this is a very virt driver specific behaivor by the way and nto one that is partically well docuemnted.
> > the ephemeral shoudl mostly exist for the lifetime of an instance. not the lifetime of a vm
> > 
> > for exmple it should nto get recreate vai a simple reboot or live migration
> > it should not get created for cold migration or rezise.
> > but it will get wipted for shelve_offload, cross cell resize and evacuate.
> So even for cold migration it would be preserved then? So my only option 
> would be to shelve such instances when trying to
> "move" instances off a certain hypervisor while NOT syncing ephemeral 
> storage?
yes shelve woudl be your only option today.
> 
> 
> > > c) Is the size of the ephemeral storage for flavors a fixed size or just
> > > the upper bound for users? So if I limit this to 1T, will such a flavor
> > > always provision a block device with his size?
> > flavor.ephemeral_gb is an upper bound and end users can devide that between multipel ephermal disks
> > on the same instance.  so if its 100G you can ask for 2 50G epmeeral disks
> > 
> > you specify the toplogy of the epmermeral disk using the block_device_mapping_v2 parmater on the server
> > create.
> > this has been automated in recent version of the openstack client
> > 
> > so you can do
> > 
> > openstack server creeate  --ephemeral size=50,format=ext4 --ephemeral size=50,format=vfat ...
> > 
> > https://docs.openstack.org/python-openstackclient/latest/cli/command-objects/server.html#cmdoption-openstack-server-create-ephemeral
> > this is limted by
> > https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.max_local_block_devices
> > 
> > > I suppose using LVM this will be thin provisioned anyways?
> > to use the lvm backend with libvirt you set
> > https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.images_volume_group
> > to identify which lvm VG to use.
> > 
> > https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.sparse_logical_volumes might enable thin provsion or it might
> > work without it but see the note
> > 
> > """
> > Warning
> > 
> > This option is deprecated for removal since 18.0.0. Its value may be silently ignored in the future.
> > 
> > Reason
> > 
> >      Sparse logical volumes is a feature that is not tested hence not supported. LVM logical volumes are preallocated by default. If you want thin
> > provisioning, use Cinder thin-provisioned volumes.
> > """
> > 
> > the nova lvm supprot has been in maintance mode for many years.
> > 
> > im not opposed to improving it just calling out that it has bugs and noone has really
> > worked on adressing them in 4 or 5 years which is sad becasue it out performnce raw for local
> > storage perfroamce and if thin provisioning still work it shoudl outperform qcow too for a simialr usecase.
> > 
> > you are well into undefined behavior land however at this point
> > 
> > we do not test it so we assume untile told otherwise that its broken.
> 
> Thanks for the heads up. I looked at LVM for cinder and there LVM 
> volumes are thin provisioned,
> so I figured this might be the case for Nova as well.
> 
> 
> > > 2) *Via Cinder*, running cinder-volume on each compute node to provide a
> > > volume type "ephemeral", using e.g. the LVM driver
> > > 
> > > a) While not really "ephemeral" and bound to the instance lifecycle,
> > > this would allow users to provision ephemeral volume just as they need them.
> > > I suppose I could use backend specific quotas
> > > (https://docs.openstack.org/cinder/latest/cli/cli-cinder-quotas.html#view-block-storage-quotas)
> > > to
> > > limit the number of size of such volumes?
> > > 
> > > b) Do I need to use the instance locality filter
> > > (https://docs.openstack.org/cinder/latest/contributor/api/cinder.scheduler.filters.instance_locality_filter.html)
> > > then?
> > That is an option but not ideally since it stilll means conencting to the volume via iscsi or nvmeof even if its effectlvy via localhost
> > so you still have the the network layer overhead.
> 
> Thanks for the hint - one can easily be confused when reading "LVM" ... 
> I actually thought there was a way to have "host-only" style volumes which
> are simply local block devices with no iscsi / NVME in between which are 
> used by Nova then.
> 
> These kind of volumes could maybe be built into cinder as a 
> "taget_protocol: local" together with the instance_locality_filter -
> but apparently now the only way is through iSCSI or NVME.
> 
> 
> > when i alas brought up this topic in a diffent context the alternitive to cinder and nova was to add a lvm cyborg driver
> > so that it could parttion local nvme devices and expose that to a guest. but i never wrote that and i dotn think anyone else has.
> > if you had a slightly diffent usecase such as providing an entire nvme or sata device to a guest the cyborge would be how you would do
> > that. nova pci passhtough is not an option as it is not multi tenant safe. its expclsively for stateless device not disk so we do not
> > have a way to rease the data when done. cyborg with htere driver modle can fullfile the multi tenancy requirement.
> > we have previously rejected adding this capabliyt into nova so i dont expect us to add it any tiem in teh near to medium term.
> 
> This sounds like a "3rd" approach: Using Cyborg to provide local storage 
> (via LVM).

yes cyborg woudl be a third approch.
i was going to enable this in a new project i was calling Arbiterd but that proposal was rejected in the last ptg so 
i currenlty have no planns to enabel local block device managment.
> 
> 
> > > c)  Since a volume will always be bound to a certain host, I suppose
> > > this will cause side-effects to instance scheduling?
> > > With the volume remaining after an instance has been destroyed (beating
> > > the purpose of it being "ephemeral") I suppose any other instance
> > > attaching this volume will
> > > be scheduling on this very machine?
> > > 
> > no nova would have no knowage about the volume locality out of the box
> > >   Is there any way around this? Maybe
> > > a driver setting to have such volumes "self-destroy" if they are not
> > > attached anymore?
> > we hate those kind of config options nova would not know that its bound to the host at the schduler level and
> > we would nto really want to add orcstration logic like that for "something its oke to delete our tenatns data"
> > by default today if you cold/live migrated the vm would move but the voluem vould not and you would end up accessing it remotely.
> > 
> > you woudl have to then do a volume migration sepreately in cinder i think.
> > > d) Same question as with Nova: What happens when an instance is
> > > live-migrated?
> > > 
> > i think i anser this above?
> 
> Yes, these questions where all due to my misconception that 
> cinder-volume backend "LVM" did not have any networking layer
> and was host-local.
> 
> 
> > > 
> > > Maybe others also have this use case and you can share your solution(s)?
> > adding a cyborg driver for lvm storage and integrateing that with nova would like be the simpelt option
> > 
> > you coudl extend nova but as i said we have rejected that in the past.
> > that said the generic resouce table we added for pemem was made generic so that future resocues like local block
> > device could be tracked there without db changes.
> > 
> > supproting differnt image_type backend for root,swap and ephmeral would be possibel.
> > its an invasive change but might be more natural then teh resouce tabel approch.
> > you coudl reuse more fo the code and inherit much fo the exiting fucntionality btu makeing sure you dont break
> > anything in the process woudl take a lot of testing.
> 
> Thanks for the sum up!
i think your two best options are add teh parmater to the migrat/resize apis to skip copying the ephmeral disks.
and second propose a replacement for
https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.images_type
these should be seperate specs.

that woudl work like the how we supprot generic mdevs using dynimc config sections

i.e. 
[libvirt]
storage_profiles=swap:swap_storage,ephmeral:ephmeral_storage,root:root_storage

[swap_stroage}
driver=raw
driver_data:/mnt/nvme-swap/nova/


[ephmeral_stroage}
driver=lvm
driver_data:vg_ephmeral

[root_storage]
driver=rbd
driver_data:vms

we woudl have to work this out in a spec but if nova was every to support something like this in the futrue
i think we would to model it somethign along those lines.

im not sure how popular this woudl be however so we would need to get input form teh wirder nova team.
i do see value in being ablt ot have differnt storage profiles for root_gb, ephmeral_gb and swap_gb in the falvor.

but the last time somethign like this was discussed was the creation of a cinder images_type backend
to allow for automatic BootFormVolume.
i actully think that would be a nice feature too but its complex and because both were disucssed aroudn the same
tiem neithre got done.

> 
> 
> 
> 
> Regards
> 
> 
> Christian
> 




More information about the openstack-discuss mailing list