[EXTERNAL] Re: Issues with virsh XML changing disk type after Nova live migration

Budden, Robert M. (GSFC-606.2)[InuTeq, LLC] robert.m.budden at nasa.gov
Mon Feb 13 15:25:46 UTC 2023


Hi Sean,

Thanks for the reply!

Yes, the backed are files from an NFS mount coming off a NetApp filer. More details in my reply to Mohammed. FWIW, these Cinder volumes have never been snapshotted, but the bug you mention sounds similar to what we are seeing.

Thanks,
-Robert

On 2/9/23, 2:09 PM, "Sean Mooney" <smooney at redhat.com <mailto:smooney at redhat.com>> wrote:


On Thu, 2023-02-09 at 13:47 -0500, Mohammed Naser wrote:
> On Thu, Feb 9, 2023 at 11:23 AM Budden, Robert M. (GSFC-606.2)[InuTeq, LLC]
> <robert.m.budden at nasa.gov <mailto:robert.m.budden at nasa.gov>> wrote:
> 
> > Hello Community,
> > 
> > 
> > 
> > We’ve hit a rather pesky bug that I’m hoping someone else has seen before.
> > 
> > 
> > 
> > We have an issue with Nova where a set of Cinder backed VMs are having
> > their XML definitions modified after a live migration. Specifically, the
> > destination host ends up having the disk type changed from ‘qcow2’ to
> > ‘raw’. This ends up with the VM becoming unbootable upon the next hard
> > reboot (or nova stop/start is issued). The required fix ATM is for us to
> > destroy the VM and recreate from the persistent Cinder volume. Clearly this
> > isn’t a maintainable solution as we rely on live migration for patching
> > infrastructure.
> > 
> 
> Can you share the bits of the libvirt XML that are changing? I'm curious
> to know what is your storage backend as well (Ceph? LVM with Cinder?)


if they have files and its a cinder backend then its a dirver that uses nfs as the prtocol


iscsi and rbd (lvm and ceph) wont have any files for the volume.


this sound like its an os-brick/ceph issue possible related to takeing snapshots of the affected vms.
snapshoting cinder (nfs) volumes that are atttached vms is not currenlty supported.
https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Freview.opendev.org%2Fc%2Fopenstack%2Fcinder%2F%2B%2F857528&data=05%7C01%7Crobert.m.budden%40nasa.gov%7C4229949ffc9f47a4565108db0ad1291d%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C638115665683520373%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=u7jYFYBYtDXxLaHkjMAHT5DgIlXnie1qeUkMcfUjdWs%3D&reserved=0 <https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Freview.opendev.org%2Fc%2Fopenstack%2Fcinder%2F%2B%2F857528&data=05%7C01%7Crobert.m.budden%40nasa.gov%7C4229949ffc9f47a4565108db0ad1291d%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C638115665683520373%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=u7jYFYBYtDXxLaHkjMAHT5DgIlXnie1qeUkMcfUjdWs%3D&reserved=0>
https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.launchpad.net%2Fcinder%2F%2Bbug%2F1989514&data=05%7C01%7Crobert.m.budden%40nasa.gov%7C4229949ffc9f47a4565108db0ad1291d%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C638115665683520373%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=epeJYiYc8tywUbTa2oT0UQttdENDfQzzxqggqO%2Bt7e4%3D&reserved=0 <https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.launchpad.net%2Fcinder%2F%2Bbug%2F1989514&data=05%7C01%7Crobert.m.budden%40nasa.gov%7C4229949ffc9f47a4565108db0ad1291d%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C638115665683520373%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=epeJYiYc8tywUbTa2oT0UQttdENDfQzzxqggqO%2Bt7e4%3D&reserved=0>


its a guess but im pretty sure that if you snapshot the volume and then its live migrated it would revert 
back form qcow to raw due to that bug.




> 
> 
> > 
> > 
> Any thoughts, ideas, would be most welcome. Gory details are below.
> > 
> > 
> > 
> > Here’s the key details we have:
> > 
> > 
> > 
> > - Nova boot an instance from the existing volume works as expected.
> > - After live migration the ‘type’ field in the virsh XML is changed
> > from ‘qcow2’ -> ‘raw’ and we get a ‘No bootable device’ from the VM
> > (rightly so)
> > - After reverting this field automatically (scripted solution), a
> > ‘nova stop’ followed by a ‘nova start’ yet again rewrites the XML with the
> > bad type=’raw’.
> > - It should be noted that before a live migration is performed ‘nova
> > stop/start’ functions as expected, no XML changes are written to the virsh
> > definition.
> > - Injecting additional logs into the python on these two hosts I’ve
> > narrowed it down to ‘bdm_info.connection_info’ on the destination end is
> > choosing the ‘raw’ parameter somehow (I’ve only got so far through the code
> > at this time). The etree.XMLParser of the source hypervisors XML definition
> > is properly parsing out ‘qcow2’ type.
> > 
> > 
> I'm kinda curious how you're hosting `qcow2` inside of a storage backend,
> usually, the storage backends want raw images only... Is there any chance
> you've played with the images and are the images that these cinder volumes
> by raw/qcow2?
> 
> 
> > Some background info:
> > 
> > - We’re currently running the Wallaby release with the latest patches.
> > - Hybrid OS (Stream8 /Rocky 8) on underlying hardware with the
> > majority of the Control Plane Stream aside from our Neutron Network Nodes.
> > Computes are roughly split 50/50 Stream/Rocky.
> > - The Cinder volumes that experience this were copied in from a
> > previous OpenStack cloud (Pike/Queens) on the backend. I.e. NetApp
> > snapmirror, new bootable Cinder volume created on the backend, and an
> > internal NetApp operation for a zero copy operation over the backing Cinder
> > file.
> > - Other Cinder volumes that don’t exhibit this to mostly be in
> > ‘raw’ format already (we haven’t vetted every single bootable Cinder volume
> > yet).
> > - We’ve noticed these Cinder volumes lack some metadata fields that
> > other Cinder volumes create by Glance have (more details below).
> > 
> > 
> Are those old running VMs or an old cloud? I wonder if those are old
> records that became raw with the upgrade *by default* and now you're stuck
> in this weird spot. If you create new volumes, are they qcow2 or raw?
> 
> 
> 
> > 
> > -
> > 
> > Ideas we’ve tried:
> > 
> > - Adjusting settings on both computes for ‘use_cow_images’ and
> > ‘force_raw_images’ seem to have zero effect.
> > - Manually setting the Cinder metadata parameters to no avail (i.e.
> > openstack volume set --image-property disk_format=qcow2).
> > 
> > 
> > 
> > 
> > 
> > Thanks!
> > 
> > -Robert
> > 
> 
> 







More information about the openstack-discuss mailing list