Issues with virsh XML changing disk type after Nova live migration
Sean Mooney
smooney at redhat.com
Thu Feb 9 19:08:56 UTC 2023
On Thu, 2023-02-09 at 13:47 -0500, Mohammed Naser wrote:
> On Thu, Feb 9, 2023 at 11:23 AM Budden, Robert M. (GSFC-606.2)[InuTeq, LLC]
> <robert.m.budden at nasa.gov> wrote:
>
> > Hello Community,
> >
> >
> >
> > We’ve hit a rather pesky bug that I’m hoping someone else has seen before.
> >
> >
> >
> > We have an issue with Nova where a set of Cinder backed VMs are having
> > their XML definitions modified after a live migration. Specifically, the
> > destination host ends up having the disk type changed from ‘qcow2’ to
> > ‘raw’. This ends up with the VM becoming unbootable upon the next hard
> > reboot (or nova stop/start is issued). The required fix ATM is for us to
> > destroy the VM and recreate from the persistent Cinder volume. Clearly this
> > isn’t a maintainable solution as we rely on live migration for patching
> > infrastructure.
> >
>
> Can you share the bits of the libvirt XML that are changing? I'm curious
> to know what is your storage backend as well (Ceph? LVM with Cinder?)
if they have files and its a cinder backend then its a dirver that uses nfs as the prtocol
iscsi and rbd (lvm and ceph) wont have any files for the volume.
this sound like its an os-brick/ceph issue possible related to takeing snapshots of the affected vms.
snapshoting cinder (nfs) volumes that are atttached vms is not currenlty supported.
https://review.opendev.org/c/openstack/cinder/+/857528
https://bugs.launchpad.net/cinder/+bug/1989514
its a guess but im pretty sure that if you snapshot the volume and then its live migrated it would revert
back form qcow to raw due to that bug.
>
>
> >
> >
> Any thoughts, ideas, would be most welcome. Gory details are below.
> >
> >
> >
> > Here’s the key details we have:
> >
> >
> >
> > - Nova boot an instance from the existing volume works as expected.
> > - After live migration the ‘type’ field in the virsh XML is changed
> > from ‘qcow2’ -> ‘raw’ and we get a ‘No bootable device’ from the VM
> > (rightly so)
> > - After reverting this field automatically (scripted solution), a
> > ‘nova stop’ followed by a ‘nova start’ yet again rewrites the XML with the
> > bad type=’raw’.
> > - It should be noted that before a live migration is performed ‘nova
> > stop/start’ functions as expected, no XML changes are written to the virsh
> > definition.
> > - Injecting additional logs into the python on these two hosts I’ve
> > narrowed it down to ‘bdm_info.connection_info’ on the destination end is
> > choosing the ‘raw’ parameter somehow (I’ve only got so far through the code
> > at this time). The etree.XMLParser of the source hypervisors XML definition
> > is properly parsing out ‘qcow2’ type.
> >
> >
> I'm kinda curious how you're hosting `qcow2` inside of a storage backend,
> usually, the storage backends want raw images only... Is there any chance
> you've played with the images and are the images that these cinder volumes
> by raw/qcow2?
>
>
> > Some background info:
> >
> > - We’re currently running the Wallaby release with the latest patches.
> > - Hybrid OS (Stream8 /Rocky 8) on underlying hardware with the
> > majority of the Control Plane Stream aside from our Neutron Network Nodes.
> > Computes are roughly split 50/50 Stream/Rocky.
> > - The Cinder volumes that experience this were copied in from a
> > previous OpenStack cloud (Pike/Queens) on the backend. I.e. NetApp
> > snapmirror, new bootable Cinder volume created on the backend, and an
> > internal NetApp operation for a zero copy operation over the backing Cinder
> > file.
> > - Other Cinder volumes that don’t exhibit this to mostly be in
> > ‘raw’ format already (we haven’t vetted every single bootable Cinder volume
> > yet).
> > - We’ve noticed these Cinder volumes lack some metadata fields that
> > other Cinder volumes create by Glance have (more details below).
> >
> >
> Are those old running VMs or an old cloud? I wonder if those are old
> records that became raw with the upgrade *by default* and now you're stuck
> in this weird spot. If you create new volumes, are they qcow2 or raw?
>
>
>
> >
> > -
> >
> > Ideas we’ve tried:
> >
> > - Adjusting settings on both computes for ‘use_cow_images’ and
> > ‘force_raw_images’ seem to have zero effect.
> > - Manually setting the Cinder metadata parameters to no avail (i.e.
> > openstack volume set --image-property disk_format=qcow2).
> >
> >
> >
> >
> >
> > Thanks!
> >
> > -Robert
> >
>
>
More information about the openstack-discuss
mailing list