4 Apr
2023
4 Apr
'23
10:39 a.m.
On Sat, Mar 25, 2023 at 12:27 AM Sean Mooney <smooney@redhat.com> wrote: > i responed in line but just a waring this is a usecase we ahve heard > before. > there is no simple option im afraid and there are many many sharp edges > and severl littel know features/limitatiosn that your question puts you > right in the > middel of. > > On Fri, 2023-03-24 at 16:28 +0100, Christian Rohmann wrote: > > Hello OpenStack-discuss, > > > > I am currently looking into how one can provide fast ephemeral storage > > (backed by local NVME drives) to instances. > > > > > > There seem to be two approaches and I would love to double-check my > > thoughts and assumptions. > > > > 1) *Via Nova* instance storage and the configurable "ephemeral" volume > > for a flavor > > > > a) We currently use Ceph RBD als image_type > > ( > https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.images_type), > > > so instance images are stored in Ceph, not locally on disk. I believe > > this setting will also cause ephemeral volumes (destination_local) to be > > placed on a RBD and not /var/lib/nova/instances? > it should be in ceph yes we do not support havign the root/swap/ephemral > disk use diffent storage locatiosn > > Or is there a setting to set a different backend for local block devices > > providing "ephemeral" storage? So RBD for the root disk and a local LVM > > VG for ephemeral? > no that would be a new feature and not a trivial one as yo uwould have to > make > sure it works for live migration and cold migration. > > > > > b) Will an ephemeral volume also be migrated when the instance is > > shutoff as with live-migration? > its hsoudl be. its not included in snapshots so its not presergved > when shelving. that means corss cell cold migration will not preserve the > disk. > > but for a normal cold migration it shoudl be scp'd or rsynced with the > root disk > if you are using the raw/qcow/flat images type if i remember correctly. > with RBD or other shared storage like nfs it really sould be preserved. > > one other thing to note is ironic and only ironic support the > preserve_ephemeral option in the rebuild api. > > libvirt will wipte the ephmeral disk if you rebuild or evacuate. > > Or will there be an new volume created on the target host? I am asking > > because I want to avoid syncing 500G or 1T when it's only "ephemeral" > > and the instance will not expect any data on it on the next boot. > i would perssonally consider it a bug if it was not transfered. > that does not mean that could not change in the future. > this is a very virt driver specific behaivor by the way and nto one that > is partically well docuemnted. > the ephemeral shoudl mostly exist for the lifetime of an instance. not the > lifetime of a vm > > for exmple it should nto get recreate vai a simple reboot or live migration > it should not get created for cold migration or rezise. > but it will get wipted for shelve_offload, cross cell resize and evacuate. > > > > c) Is the size of the ephemeral storage for flavors a fixed size or just > > the upper bound for users? So if I limit this to 1T, will such a flavor > > always provision a block device with his size? > flavor.ephemeral_gb is an upper bound and end users can devide that > between multipel ephermal disks > on the same instance. so if its 100G you can ask for 2 50G epmeeral disks > > you specify the toplogy of the epmermeral disk using the > block_device_mapping_v2 parmater on the server > create. > this has been automated in recent version of the openstack client > > so you can do > > openstack server creeate --ephemeral size=50,format=ext4 --ephemeral > size=50,format=vfat ... > > > https://docs.openstack.org/python-openstackclient/latest/cli/command-objects/server.html#cmdoption-openstack-server-create-ephemeral > this is limted by > > https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.max_local_block_devices > > > > > I suppose using LVM this will be thin provisioned anyways? > to use the lvm backend with libvirt you set > > https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.images_volume_group > to identify which lvm VG to use. > > > https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.sparse_logical_volumes > might enable thin provsion or it might > work without it but see the note > > """ > Warning > > This option is deprecated for removal since 18.0.0. Its value may be > silently ignored in the future. > > Reason > > Sparse logical volumes is a feature that is not tested hence not > supported. LVM logical volumes are preallocated by default. If you want thin > provisioning, use Cinder thin-provisioned volumes. > """ > > the nova lvm supprot has been in maintance mode for many years. > > im not opposed to improving it just calling out that it has bugs and noone > has really > worked on adressing them in 4 or 5 years which is sad becasue it out > performnce raw for local > storage perfroamce and if thin provisioning still work it shoudl > outperform qcow too for a simialr usecase. > > you are well into undefined behavior land however at this point > > we do not test it so we assume untile told otherwise that its broken. > > > > > > > > 2) *Via Cinder*, running cinder-volume on each compute node to provide a > > volume type "ephemeral", using e.g. the LVM driver > > > > a) While not really "ephemeral" and bound to the instance lifecycle, > > this would allow users to provision ephemeral volume just as they need > them. > > I suppose I could use backend specific quotas > > ( > https://docs.openstack.org/cinder/latest/cli/cli-cinder-quotas.html#view-block-storage-quotas) > > > to > > limit the number of size of such volumes? > > > > b) Do I need to use the instance locality filter > > ( > https://docs.openstack.org/cinder/latest/contributor/api/cinder.scheduler.filters.instance_locality_filter.html) > > > then? > > That is an option but not ideally since it stilll means conencting to the > volume via iscsi or nvmeof even if its effectlvy via localhost > so you still have the the network layer overhead. > > I haven't tried it so I'm not 100% sure if it works but we do support local attach with the RBD connector. While creating the connector object, we can pass "do_local_attach"=True[1] and that should do local attach when we call connect volume for the RBD connector. >From a quick search, I can see all the consumers of this code are: 1) cinderlib[3] 2) nova hyperv driver[4] 3) python-brick-cinderclient-ext[5] 4) freezer[6] 5) zun[7] It's interesting to see a project called compute-hyperv[8] (similar to nova's hyperv driver) using it as well. Not sure why it's created separately though. [1] https://opendev.org/openstack/os-brick/src/commit/28ffcdbfa138859859beca2f80627c076269be56/os_brick/initiator/connectors/rbd.py [2] https://opendev.org/openstack/os-brick/src/commit/28ffcdbfa138859859beca2f80627c076269be56/os_brick/initiator/connectors/rbd.py#L263-L267 [3] https://opendev.org/openstack/cinderlib/src/commit/9c37686f358e9228446cd85e19db26a56b2f9cbe/cinderlib/objects.py#L779 [4] https://opendev.org/openstack/nova/src/commit/29de62bf3b3bf5eda8986bc94babf1c94d67bd4e/nova/virt/hyperv/volumeops.py#L378 [5] https://opendev.org/openstack/python-brick-cinderclient-ext/src/branch/master/brick_cinderclient_ext/client.py [6] https://opendev.org/openstack/freezer/src/commit/5effc1382833ad111249bcd279b11fbe95e10a6b/freezer/engine/osbrick/client.py#L78 [7] https://opendev.org/openstack/zun/src/commit/0288a4517846d07ee5724f86ebe34e364dc2bbe9/zun/volume/cinder_workflow.py#L60-L61 [8] https://opendev.org/openstack/compute-hyperv/src/commit/4393891fa8356aa31b13bd57cf96cb5109acc7c3/compute_hyperv/nova/volumeops.py#L780 when i alas brought up this topic in a diffent context the alternitive to > cinder and nova was to add a lvm cyborg driver > so that it could parttion local nvme devices and expose that to a guest. > but i never wrote that and i dotn think anyone else has. > if you had a slightly diffent usecase such as providing an entire nvme or > sata device to a guest the cyborge would be how you would do > that. nova pci passhtough is not an option as it is not multi tenant safe. > its expclsively for stateless device not disk so we do not > have a way to rease the data when done. cyborg with htere driver modle can > fullfile the multi tenancy requirement. > we have previously rejected adding this capabliyt into nova so i dont > expect us to add it any tiem in teh near to medium term. > > we are trying to keep nova device manamgnet to stateless only. > That said we added intel PMEM/NVDIM supprot to nova and did handle both > optionl data transfer and multi tancny but that was a non trivial amount of > work > > > > > > c) Since a volume will always be bound to a certain host, I suppose > > this will cause side-effects to instance scheduling? > > With the volume remaining after an instance has been destroyed (beating > > the purpose of it being "ephemeral") I suppose any other instance > > attaching this volume will > > be scheduling on this very machine? > > > no nova would have no knowage about the volume locality out of the box > > Is there any way around this? Maybe > > a driver setting to have such volumes "self-destroy" if they are not > > attached anymore? > we hate those kind of config options nova would not know that its bound to > the host at the schduler level and > we would nto really want to add orcstration logic like that for "something > its oke to delete our tenatns data" > by default today if you cold/live migrated the vm would move but the > voluem vould not and you would end up accessing it remotely. > > you woudl have to then do a volume migration sepreately in cinder i think. > > > > d) Same question as with Nova: What happens when an instance is > > live-migrated? > > > i think i anser this above? > > > > > > Maybe others also have this use case and you can share your solution(s)? > adding a cyborg driver for lvm storage and integrateing that with nova > would like be the simpelt option > > you coudl extend nova but as i said we have rejected that in the past. > that said the generic resouce table we added for pemem was made generic so > that future resocues like local block > device could be tracked there without db changes. > > supproting differnt image_type backend for root,swap and ephmeral would be > possibel. > its an invasive change but might be more natural then teh resouce tabel > approch. > you coudl reuse more fo the code and inherit much fo the exiting > fucntionality btu makeing sure you dont break > anything in the process woudl take a lot of testing. > > > Thanks and with regards > > > > > > Christian > > > > > > >