<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Mar 25, 2023 at 12:27 AM Sean Mooney <<a href="mailto:smooney@redhat.com">smooney@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">i responed in line but just a waring this is a usecase we ahve heard before.<br>

there is no simple option im afraid and there are many many sharp edges<br>

and severl littel know features/limitatiosn that your question puts you right in the<br>

middel of.<br>

<br>

On Fri, 2023-03-24 at 16:28 +0100, Christian Rohmann wrote:<br>

> Hello OpenStack-discuss,<br>

> <br>

> I am currently looking into how one can provide fast ephemeral storage <br>

> (backed by local NVME drives) to instances.<br>

> <br>

> <br>

> There seem to be two approaches and I would love to double-check my <br>

> thoughts and assumptions.<br>

> <br>

> 1) *Via Nova* instance storage and the configurable "ephemeral" volume <br>

> for a flavor<br>

> <br>

> a) We currently use Ceph RBD als image_type <br>

> (<a href="https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.images_type" rel="noreferrer" target="_blank">https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.images_type</a>), <br>

> so instance images are stored in Ceph, not locally on disk. I believe <br>

> this setting will also cause ephemeral volumes (destination_local) to be <br>

> placed on a RBD and not /var/lib/nova/instances?<br>

it should be in ceph yes we do not support havign the root/swap/ephemral<br>

disk use diffent storage locatiosn<br>

> Or is there a setting to set a different backend for local block devices <br>

> providing "ephemeral" storage? So RBD for the root disk and a local LVM <br>

> VG for ephemeral?<br>

no that would be a new feature and not a trivial one as yo uwould have to make<br>

sure it works for live migration and cold migration.<br>

<br>

> <br>

> b) Will an ephemeral volume also be migrated when the instance is <br>

> shutoff as with live-migration?<br>

its hsoudl be. its not included in snapshots so its not presergved<br>

when shelving. that means corss cell cold migration will not preserve the disk.<br>

<br>

but for a normal cold migration it shoudl be scp'd or rsynced with the root disk<br>

if you are using the raw/qcow/flat images type if i remember correctly.<br>

with RBD or other shared storage like nfs it really sould be preserved.<br>

<br>

one other thing to note is ironic and only ironic support the <br>

preserve_ephemeral option in the rebuild api.<br>

<br>

libvirt will wipte the ephmeral disk if you rebuild or evacuate.<br>

> Or will there be an new volume created on the target host? I am asking <br>

> because I want to avoid syncing 500G or 1T when it's only "ephemeral" <br>

> and the instance will not expect any data on it on the next boot.<br>

i would perssonally consider it a bug if it was not transfered.<br>

that does not mean that could not change in the future.<br>

this is a very virt driver specific behaivor by the way and nto one that is partically well docuemnted.<br>

the ephemeral shoudl mostly exist for the lifetime of an instance. not the lifetime of a vm<br>

<br>

for exmple it should nto get recreate vai a simple reboot or live migration<br>

it should not get created for cold migration or rezise.<br>

but it will get wipted for shelve_offload, cross cell resize and evacuate.<br>

> <br>

> c) Is the size of the ephemeral storage for flavors a fixed size or just <br>

> the upper bound for users? So if I limit this to 1T, will such a flavor <br>

> always provision a block device with his size?<br>

flavor.ephemeral_gb is an upper bound and end users can devide that between multipel ephermal disks <br>

on the same instance.  so if its 100G you can ask for 2 50G epmeeral disks<br>

<br>

you specify the toplogy of the epmermeral disk using the block_device_mapping_v2 parmater on the server<br>

create.<br>

this has been automated in recent version of the openstack client <br>

<br>

so you can do <br>

<br>

openstack server creeate  --ephemeral size=50,format=ext4 --ephemeral size=50,format=vfat ...<br>

<br>

<a href="https://docs.openstack.org/python-openstackclient/latest/cli/command-objects/server.html#cmdoption-openstack-server-create-ephemeral" rel="noreferrer" target="_blank">https://docs.openstack.org/python-openstackclient/latest/cli/command-objects/server.html#cmdoption-openstack-server-create-ephemeral</a><br>

this is limted by <br>

<a href="https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.max_local_block_devices" rel="noreferrer" target="_blank">https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.max_local_block_devices</a><br>

<br>

> <br>

> I suppose using LVM this will be thin provisioned anyways?<br>

to use the lvm backend with libvirt you set<br>

<a href="https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.images_volume_group" rel="noreferrer" target="_blank">https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.images_volume_group</a><br>

to identify which lvm VG to use.<br>

<br>

<a href="https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.sparse_logical_volumes" rel="noreferrer" target="_blank">https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.sparse_logical_volumes</a> might enable thin provsion or it might<br>

work without it but see the note<br>

<br>

""" <br>

Warning<br>

<br>

This option is deprecated for removal since 18.0.0. Its value may be silently ignored in the future.<br>

<br>

Reason<br>

<br>

    Sparse logical volumes is a feature that is not tested hence not supported. LVM logical volumes are preallocated by default. If you want thin<br>

provisioning, use Cinder thin-provisioned volumes.<br>

"""<br>

<br>

the nova lvm supprot has been in maintance mode for many years.<br>

<br>

im not opposed to improving it just calling out that it has bugs and noone has really<br>

worked on adressing them in 4 or 5 years which is sad becasue it out performnce raw for local<br>

storage perfroamce and if thin provisioning still work it shoudl outperform qcow too for a simialr usecase.<br>

<br>

you are well into undefined behavior land however at this point<br>

<br>

we do not test it so we assume untile told otherwise that its broken.<br>

<br>

<br>

> <br>

> <br>

> 2) *Via Cinder*, running cinder-volume on each compute node to provide a <br>

> volume type "ephemeral", using e.g. the LVM driver<br>

> <br>

> a) While not really "ephemeral" and bound to the instance lifecycle, <br>

> this would allow users to provision ephemeral volume just as they need them.<br>

> I suppose I could use backend specific quotas <br>

> (<a href="https://docs.openstack.org/cinder/latest/cli/cli-cinder-quotas.html#view-block-storage-quotas" rel="noreferrer" target="_blank">https://docs.openstack.org/cinder/latest/cli/cli-cinder-quotas.html#view-block-storage-quotas</a>) <br>

> to<br>

> limit the number of size of such volumes?<br>

> <br>

> b) Do I need to use the instance locality filter <br>

> (<a href="https://docs.openstack.org/cinder/latest/contributor/api/cinder.scheduler.filters.instance_locality_filter.html" rel="noreferrer" target="_blank">https://docs.openstack.org/cinder/latest/contributor/api/cinder.scheduler.filters.instance_locality_filter.html</a>) <br>

> then?<br>

<br>

That is an option but not ideally since it stilll means conencting to the volume via iscsi or nvmeof even if its effectlvy via localhost<br>

so you still have the the network layer overhead.<br>

<br></blockquote><div><br></div><div>I haven't tried it so I'm not 100% sure if it works but we do support local attach with the RBD connector.</div><div>While creating the connector object, we can pass "do_local_attach"=True[1] and that should do local attach when we call</div><div>connect volume for the RBD connector.</div><div>From a quick search, I can see all the consumers of this code are:</div><div>1) cinderlib[3]</div><div>2) nova hyperv driver[4]</div><div>3) python-brick-cinderclient-ext[5]</div><div>4) freezer[6]</div><div>5) zun[7]</div><div><br></div><div>It's interesting to see a project called compute-hyperv[8] (similar to nova's hyperv driver) using it as well. Not sure why it's created separately though.</div><div><br></div><div>[1] <a href="https://opendev.org/openstack/os-brick/src/commit/28ffcdbfa138859859beca2f80627c076269be56/os_brick/initiator/connectors/rbd.py">https://opendev.org/openstack/os-brick/src/commit/28ffcdbfa138859859beca2f80627c076269be56/os_brick/initiator/connectors/rbd.py</a></div><div>[2] <a href="https://opendev.org/openstack/os-brick/src/commit/28ffcdbfa138859859beca2f80627c076269be56/os_brick/initiator/connectors/rbd.py#L263-L267">https://opendev.org/openstack/os-brick/src/commit/28ffcdbfa138859859beca2f80627c076269be56/os_brick/initiator/connectors/rbd.py#L263-L267</a></div><div>[3] <a href="https://opendev.org/openstack/cinderlib/src/commit/9c37686f358e9228446cd85e19db26a56b2f9cbe/cinderlib/objects.py#L779">https://opendev.org/openstack/cinderlib/src/commit/9c37686f358e9228446cd85e19db26a56b2f9cbe/cinderlib/objects.py#L779</a></div><div>[4] <a href="https://opendev.org/openstack/nova/src/commit/29de62bf3b3bf5eda8986bc94babf1c94d67bd4e/nova/virt/hyperv/volumeops.py#L378">https://opendev.org/openstack/nova/src/commit/29de62bf3b3bf5eda8986bc94babf1c94d67bd4e/nova/virt/hyperv/volumeops.py#L378</a></div><div>[5] <a href="https://opendev.org/openstack/python-brick-cinderclient-ext/src/branch/master/brick_cinderclient_ext/client.py">https://opendev.org/openstack/python-brick-cinderclient-ext/src/branch/master/brick_cinderclient_ext/client.py</a></div><div>[6] <a href="https://opendev.org/openstack/freezer/src/commit/5effc1382833ad111249bcd279b11fbe95e10a6b/freezer/engine/osbrick/client.py#L78">https://opendev.org/openstack/freezer/src/commit/5effc1382833ad111249bcd279b11fbe95e10a6b/freezer/engine/osbrick/client.py#L78</a></div><div>[7] <a href="https://opendev.org/openstack/zun/src/commit/0288a4517846d07ee5724f86ebe34e364dc2bbe9/zun/volume/cinder_workflow.py#L60-L61">https://opendev.org/openstack/zun/src/commit/0288a4517846d07ee5724f86ebe34e364dc2bbe9/zun/volume/cinder_workflow.py#L60-L61</a></div><div>[8] <a href="https://opendev.org/openstack/compute-hyperv/src/commit/4393891fa8356aa31b13bd57cf96cb5109acc7c3/compute_hyperv/nova/volumeops.py#L780">https://opendev.org/openstack/compute-hyperv/src/commit/4393891fa8356aa31b13bd57cf96cb5109acc7c3/compute_hyperv/nova/volumeops.py#L780</a></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

when i alas brought up this topic in a diffent context the alternitive to cinder and nova was to add a lvm cyborg driver<br>

so that it could parttion local nvme devices and expose that to a guest. but i never wrote that and i dotn think anyone else has.<br>

if you had a slightly diffent usecase such as providing an entire nvme or sata device to a guest the cyborge would be how you would do<br>

that. nova pci passhtough is not an option as it is not multi tenant safe. its expclsively for stateless device not disk so we do not<br>

have a way to rease the data when done. cyborg with htere driver modle can fullfile the multi tenancy requirement.<br>

we have previously rejected adding this capabliyt into nova so i dont expect us to add it any tiem in teh near to medium term.<br>

<br>

we are trying to keep nova device manamgnet to stateless only.<br>

That said we added intel PMEM/NVDIM supprot to nova and did handle both optionl data transfer and multi tancny but that was a non trivial amount of<br>

work<br>

<br>

<br>

> <br>

> c)  Since a volume will always be bound to a certain host, I suppose <br>

> this will cause side-effects to instance scheduling?<br>

> With the volume remaining after an instance has been destroyed (beating <br>

> the purpose of it being "ephemeral") I suppose any other instance <br>

> attaching this volume will<br>

> be scheduling on this very machine?<br>

> <br>

no nova would have no knowage about the volume locality out of the box<br>

>  Is there any way around this? Maybe <br>

> a driver setting to have such volumes "self-destroy" if they are not <br>

> attached anymore?<br>

we hate those kind of config options nova would not know that its bound to the host at the schduler level and<br>

we would nto really want to add orcstration logic like that for "something its oke to delete our tenatns data"<br>

by default today if you cold/live migrated the vm would move but the voluem vould not and you would end up accessing it remotely.<br>

<br>

you woudl have to then do a volume migration sepreately in cinder i think.<br>

> <br>

> d) Same question as with Nova: What happens when an instance is <br>

> live-migrated?<br>

> <br>

i think i anser this above?<br>

> <br>

> <br>

> Maybe others also have this use case and you can share your solution(s)?<br>

adding a cyborg driver for lvm storage and integrateing that with nova would like be the simpelt option<br>

<br>

you coudl extend nova but as i said we have rejected that in the past.<br>

that said the generic resouce table we added for pemem was made generic so that future resocues like local block<br>

device could be tracked there without db changes.<br>

<br>

supproting differnt image_type backend for root,swap and ephmeral would be possibel.<br>

its an invasive change but might be more natural then teh resouce tabel approch.<br>

you coudl reuse more fo the code and inherit much fo the exiting fucntionality btu makeing sure you dont break<br>

anything in the process woudl take a lot of testing.<br>

<br>

> Thanks and with regards<br>

> <br>

> <br>

> Christian<br>

> <br>

> <br>

<br>

<br>

</blockquote></div></div>