<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Mar 25, 2023 at 12:27 AM Sean Mooney <<a href="mailto:smooney@redhat.com">smooney@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">i responed in line but just a waring this is a usecase we ahve heard before.<br>
there is no simple option im afraid and there are many many sharp edges<br>
and severl littel know features/limitatiosn that your question puts you right in the<br>
middel of.<br>
<br>
On Fri, 2023-03-24 at 16:28 +0100, Christian Rohmann wrote:<br>
> Hello OpenStack-discuss,<br>
> <br>
> I am currently looking into how one can provide fast ephemeral storage <br>
> (backed by local NVME drives) to instances.<br>
> <br>
> <br>
> There seem to be two approaches and I would love to double-check my <br>
> thoughts and assumptions.<br>
> <br>
> 1) *Via Nova* instance storage and the configurable "ephemeral" volume <br>
> for a flavor<br>
> <br>
> a) We currently use Ceph RBD als image_type <br>
> (<a href="https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.images_type" rel="noreferrer" target="_blank">https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.images_type</a>), <br>
> so instance images are stored in Ceph, not locally on disk. I believe <br>
> this setting will also cause ephemeral volumes (destination_local) to be <br>
> placed on a RBD and not /var/lib/nova/instances?<br>
it should be in ceph yes we do not support havign the root/swap/ephemral<br>
disk use diffent storage locatiosn<br>
> Or is there a setting to set a different backend for local block devices <br>
> providing "ephemeral" storage? So RBD for the root disk and a local LVM <br>
> VG for ephemeral?<br>
no that would be a new feature and not a trivial one as yo uwould have to make<br>
sure it works for live migration and cold migration.<br>
<br>
> <br>
> b) Will an ephemeral volume also be migrated when the instance is <br>
> shutoff as with live-migration?<br>
its hsoudl be. its not included in snapshots so its not presergved<br>
when shelving. that means corss cell cold migration will not preserve the disk.<br>
<br>
but for a normal cold migration it shoudl be scp'd or rsynced with the root disk<br>
if you are using the raw/qcow/flat images type if i remember correctly.<br>
with RBD or other shared storage like nfs it really sould be preserved.<br>
<br>
one other thing to note is ironic and only ironic support the <br>
preserve_ephemeral option in the rebuild api.<br>
<br>
libvirt will wipte the ephmeral disk if you rebuild or evacuate.<br>
> Or will there be an new volume created on the target host? I am asking <br>
> because I want to avoid syncing 500G or 1T when it's only "ephemeral" <br>
> and the instance will not expect any data on it on the next boot.<br>
i would perssonally consider it a bug if it was not transfered.<br>
that does not mean that could not change in the future.<br>
this is a very virt driver specific behaivor by the way and nto one that is partically well docuemnted.<br>
the ephemeral shoudl mostly exist for the lifetime of an instance. not the lifetime of a vm<br>
<br>
for exmple it should nto get recreate vai a simple reboot or live migration<br>
it should not get created for cold migration or rezise.<br>
but it will get wipted for shelve_offload, cross cell resize and evacuate.<br>
> <br>
> c) Is the size of the ephemeral storage for flavors a fixed size or just <br>
> the upper bound for users? So if I limit this to 1T, will such a flavor <br>
> always provision a block device with his size?<br>
flavor.ephemeral_gb is an upper bound and end users can devide that between multipel ephermal disks <br>
on the same instance. so if its 100G you can ask for 2 50G epmeeral disks<br>
<br>
you specify the toplogy of the epmermeral disk using the block_device_mapping_v2 parmater on the server<br>
create.<br>
this has been automated in recent version of the openstack client <br>
<br>
so you can do <br>
<br>
openstack server creeate --ephemeral size=50,format=ext4 --ephemeral size=50,format=vfat ...<br>
<br>
<a href="https://docs.openstack.org/python-openstackclient/latest/cli/command-objects/server.html#cmdoption-openstack-server-create-ephemeral" rel="noreferrer" target="_blank">https://docs.openstack.org/python-openstackclient/latest/cli/command-objects/server.html#cmdoption-openstack-server-create-ephemeral</a><br>
this is limted by <br>
<a href="https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.max_local_block_devices" rel="noreferrer" target="_blank">https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.max_local_block_devices</a><br>
<br>
> <br>
> I suppose using LVM this will be thin provisioned anyways?<br>
to use the lvm backend with libvirt you set<br>
<a href="https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.images_volume_group" rel="noreferrer" target="_blank">https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.images_volume_group</a><br>
to identify which lvm VG to use.<br>
<br>
<a href="https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.sparse_logical_volumes" rel="noreferrer" target="_blank">https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.sparse_logical_volumes</a> might enable thin provsion or it might<br>
work without it but see the note<br>
<br>
""" <br>
Warning<br>
<br>
This option is deprecated for removal since 18.0.0. Its value may be silently ignored in the future.<br>
<br>
Reason<br>
<br>
Sparse logical volumes is a feature that is not tested hence not supported. LVM logical volumes are preallocated by default. If you want thin<br>
provisioning, use Cinder thin-provisioned volumes.<br>
"""<br>
<br>
the nova lvm supprot has been in maintance mode for many years.<br>
<br>
im not opposed to improving it just calling out that it has bugs and noone has really<br>
worked on adressing them in 4 or 5 years which is sad becasue it out performnce raw for local<br>
storage perfroamce and if thin provisioning still work it shoudl outperform qcow too for a simialr usecase.<br>
<br>
you are well into undefined behavior land however at this point<br>
<br>
we do not test it so we assume untile told otherwise that its broken.<br>
<br>
<br>
> <br>
> <br>
> 2) *Via Cinder*, running cinder-volume on each compute node to provide a <br>
> volume type "ephemeral", using e.g. the LVM driver<br>
> <br>
> a) While not really "ephemeral" and bound to the instance lifecycle, <br>
> this would allow users to provision ephemeral volume just as they need them.<br>
> I suppose I could use backend specific quotas <br>
> (<a href="https://docs.openstack.org/cinder/latest/cli/cli-cinder-quotas.html#view-block-storage-quotas" rel="noreferrer" target="_blank">https://docs.openstack.org/cinder/latest/cli/cli-cinder-quotas.html#view-block-storage-quotas</a>) <br>
> to<br>
> limit the number of size of such volumes?<br>
> <br>
> b) Do I need to use the instance locality filter <br>
> (<a href="https://docs.openstack.org/cinder/latest/contributor/api/cinder.scheduler.filters.instance_locality_filter.html" rel="noreferrer" target="_blank">https://docs.openstack.org/cinder/latest/contributor/api/cinder.scheduler.filters.instance_locality_filter.html</a>) <br>
> then?<br>
<br>
That is an option but not ideally since it stilll means conencting to the volume via iscsi or nvmeof even if its effectlvy via localhost<br>
so you still have the the network layer overhead.<br>
<br></blockquote><div><br></div><div>I haven't tried it so I'm not 100% sure if it works but we do support local attach with the RBD connector.</div><div>While creating the connector object, we can pass "do_local_attach"=True[1] and that should do local attach when we call</div><div>connect volume for the RBD connector.</div><div>From a quick search, I can see all the consumers of this code are:</div><div>1) cinderlib[3]</div><div>2) nova hyperv driver[4]</div><div>3) python-brick-cinderclient-ext[5]</div><div>4) freezer[6]</div><div>5) zun[7]</div><div><br></div><div>It's interesting to see a project called compute-hyperv[8] (similar to nova's hyperv driver) using it as well. Not sure why it's created separately though.</div><div><br></div><div>[1] <a href="https://opendev.org/openstack/os-brick/src/commit/28ffcdbfa138859859beca2f80627c076269be56/os_brick/initiator/connectors/rbd.py">https://opendev.org/openstack/os-brick/src/commit/28ffcdbfa138859859beca2f80627c076269be56/os_brick/initiator/connectors/rbd.py</a></div><div>[2] <a href="https://opendev.org/openstack/os-brick/src/commit/28ffcdbfa138859859beca2f80627c076269be56/os_brick/initiator/connectors/rbd.py#L263-L267">https://opendev.org/openstack/os-brick/src/commit/28ffcdbfa138859859beca2f80627c076269be56/os_brick/initiator/connectors/rbd.py#L263-L267</a></div><div>[3] <a href="https://opendev.org/openstack/cinderlib/src/commit/9c37686f358e9228446cd85e19db26a56b2f9cbe/cinderlib/objects.py#L779">https://opendev.org/openstack/cinderlib/src/commit/9c37686f358e9228446cd85e19db26a56b2f9cbe/cinderlib/objects.py#L779</a></div><div>[4] <a href="https://opendev.org/openstack/nova/src/commit/29de62bf3b3bf5eda8986bc94babf1c94d67bd4e/nova/virt/hyperv/volumeops.py#L378">https://opendev.org/openstack/nova/src/commit/29de62bf3b3bf5eda8986bc94babf1c94d67bd4e/nova/virt/hyperv/volumeops.py#L378</a></div><div>[5] <a href="https://opendev.org/openstack/python-brick-cinderclient-ext/src/branch/master/brick_cinderclient_ext/client.py">https://opendev.org/openstack/python-brick-cinderclient-ext/src/branch/master/brick_cinderclient_ext/client.py</a></div><div>[6] <a href="https://opendev.org/openstack/freezer/src/commit/5effc1382833ad111249bcd279b11fbe95e10a6b/freezer/engine/osbrick/client.py#L78">https://opendev.org/openstack/freezer/src/commit/5effc1382833ad111249bcd279b11fbe95e10a6b/freezer/engine/osbrick/client.py#L78</a></div><div>[7] <a href="https://opendev.org/openstack/zun/src/commit/0288a4517846d07ee5724f86ebe34e364dc2bbe9/zun/volume/cinder_workflow.py#L60-L61">https://opendev.org/openstack/zun/src/commit/0288a4517846d07ee5724f86ebe34e364dc2bbe9/zun/volume/cinder_workflow.py#L60-L61</a></div><div>[8] <a href="https://opendev.org/openstack/compute-hyperv/src/commit/4393891fa8356aa31b13bd57cf96cb5109acc7c3/compute_hyperv/nova/volumeops.py#L780">https://opendev.org/openstack/compute-hyperv/src/commit/4393891fa8356aa31b13bd57cf96cb5109acc7c3/compute_hyperv/nova/volumeops.py#L780</a></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
when i alas brought up this topic in a diffent context the alternitive to cinder and nova was to add a lvm cyborg driver<br>
so that it could parttion local nvme devices and expose that to a guest. but i never wrote that and i dotn think anyone else has.<br>
if you had a slightly diffent usecase such as providing an entire nvme or sata device to a guest the cyborge would be how you would do<br>
that. nova pci passhtough is not an option as it is not multi tenant safe. its expclsively for stateless device not disk so we do not<br>
have a way to rease the data when done. cyborg with htere driver modle can fullfile the multi tenancy requirement.<br>
we have previously rejected adding this capabliyt into nova so i dont expect us to add it any tiem in teh near to medium term.<br>
<br>
we are trying to keep nova device manamgnet to stateless only.<br>
That said we added intel PMEM/NVDIM supprot to nova and did handle both optionl data transfer and multi tancny but that was a non trivial amount of<br>
work<br>
<br>
<br>
> <br>
> c) Since a volume will always be bound to a certain host, I suppose <br>
> this will cause side-effects to instance scheduling?<br>
> With the volume remaining after an instance has been destroyed (beating <br>
> the purpose of it being "ephemeral") I suppose any other instance <br>
> attaching this volume will<br>
> be scheduling on this very machine?<br>
> <br>
no nova would have no knowage about the volume locality out of the box<br>
> Is there any way around this? Maybe <br>
> a driver setting to have such volumes "self-destroy" if they are not <br>
> attached anymore?<br>
we hate those kind of config options nova would not know that its bound to the host at the schduler level and<br>
we would nto really want to add orcstration logic like that for "something its oke to delete our tenatns data"<br>
by default today if you cold/live migrated the vm would move but the voluem vould not and you would end up accessing it remotely.<br>
<br>
you woudl have to then do a volume migration sepreately in cinder i think.<br>
> <br>
> d) Same question as with Nova: What happens when an instance is <br>
> live-migrated?<br>
> <br>
i think i anser this above?<br>
> <br>
> <br>
> Maybe others also have this use case and you can share your solution(s)?<br>
adding a cyborg driver for lvm storage and integrateing that with nova would like be the simpelt option<br>
<br>
you coudl extend nova but as i said we have rejected that in the past.<br>
that said the generic resouce table we added for pemem was made generic so that future resocues like local block<br>
device could be tracked there without db changes.<br>
<br>
supproting differnt image_type backend for root,swap and ephmeral would be possibel.<br>
its an invasive change but might be more natural then teh resouce tabel approch.<br>
you coudl reuse more fo the code and inherit much fo the exiting fucntionality btu makeing sure you dont break<br>
anything in the process woudl take a lot of testing.<br>
<br>
> Thanks and with regards<br>
> <br>
> <br>
> Christian<br>
> <br>
> <br>
<br>
<br>
</blockquote></div></div>