Issues with virsh XML changing disk type after Nova live migration

Sean Mooney smooney at redhat.com
Thu Feb 9 19:08:56 UTC 2023


On Thu, 2023-02-09 at 13:47 -0500, Mohammed Naser wrote:
> On Thu, Feb 9, 2023 at 11:23 AM Budden, Robert M. (GSFC-606.2)[InuTeq, LLC]
> <robert.m.budden at nasa.gov> wrote:
> 
> > Hello Community,
> > 
> > 
> > 
> > We’ve hit a rather pesky bug that I’m hoping someone else has seen before.
> > 
> > 
> > 
> > We have an issue with Nova where a set of Cinder backed VMs are having
> > their XML definitions modified after a live migration. Specifically, the
> > destination host ends up having the disk type changed from ‘qcow2’ to
> > ‘raw’. This ends up with the VM becoming unbootable upon the next hard
> > reboot (or nova stop/start is issued). The required fix ATM is for us to
> > destroy the VM and recreate from the persistent Cinder volume. Clearly this
> > isn’t a maintainable solution as we rely on live migration for patching
> > infrastructure.
> > 
> 
> Can you share the bits of the libvirt XML that are changing?  I'm curious
> to know what is your storage backend as well (Ceph?  LVM with Cinder?)

if they have files and its a cinder backend then its a dirver that uses nfs as the prtocol

iscsi and rbd (lvm and ceph) wont have any files for the volume.

this sound like its an os-brick/ceph issue possible related to takeing snapshots of the affected vms.
snapshoting cinder (nfs) volumes that are atttached vms is not currenlty supported.
https://review.opendev.org/c/openstack/cinder/+/857528
https://bugs.launchpad.net/cinder/+bug/1989514

its a guess but im pretty sure that if you snapshot the volume and then its live migrated it would revert 
back form qcow to raw due to that bug.


> 
> 
> > 
> > 
> Any thoughts, ideas, would be most welcome. Gory details are below.
> > 
> > 
> > 
> > Here’s the key details we have:
> > 
> > 
> > 
> >    - Nova boot an instance from the existing volume works as expected.
> >    - After live migration the ‘type’ field in the virsh XML is changed
> >    from ‘qcow2’ -> ‘raw’ and we get a ‘No bootable device’ from the VM
> >    (rightly so)
> >    - After reverting this field automatically (scripted solution), a
> >    ‘nova stop’ followed by a ‘nova start’ yet again rewrites the XML with the
> >    bad type=’raw’.
> >    - It should be noted that before a live migration is performed ‘nova
> >    stop/start’ functions as expected, no XML changes are written to the virsh
> >    definition.
> >    - Injecting additional logs into the python on these two hosts I’ve
> >    narrowed it down to ‘bdm_info.connection_info’ on the destination end is
> >    choosing the ‘raw’ parameter somehow (I’ve only got so far through the code
> >    at this time). The etree.XMLParser of the source hypervisors XML definition
> >    is properly parsing out ‘qcow2’ type.
> > 
> > 
> I'm kinda curious how you're hosting `qcow2` inside of a storage backend,
> usually, the storage backends want raw images only...  Is there any chance
> you've played with the images and are the images that these cinder volumes
> by raw/qcow2?
> 
> 
> > Some background info:
> > 
> >    - We’re currently running the Wallaby release with the latest patches.
> >    - Hybrid OS (Stream8 /Rocky 8) on underlying hardware with the
> >    majority of the Control Plane Stream aside from our Neutron Network Nodes.
> >    Computes are roughly split 50/50 Stream/Rocky.
> >    - The Cinder volumes that experience this were copied in from a
> >    previous OpenStack cloud (Pike/Queens) on the backend. I.e. NetApp
> >    snapmirror, new bootable Cinder volume created on the backend, and an
> >    internal NetApp operation for a zero copy operation over the backing Cinder
> >    file.
> >       - Other Cinder volumes that don’t exhibit this to mostly be in
> >       ‘raw’ format already (we haven’t vetted every single bootable Cinder volume
> >       yet).
> >    - We’ve noticed these Cinder volumes lack some metadata fields that
> >    other Cinder volumes create by Glance have (more details below).
> > 
> > 
> Are those old running VMs or an old cloud?  I wonder if those are old
> records that became raw with the upgrade *by default* and now you're stuck
> in this weird spot.  If you create new volumes, are they qcow2 or raw?
> 
> 
> 
> > 
> >    -
> > 
> > Ideas we’ve tried:
> > 
> >    - Adjusting settings on both computes for ‘use_cow_images’ and
> >    ‘force_raw_images’ seem to have zero effect.
> >    - Manually setting the Cinder metadata parameters to no avail (i.e.
> >    openstack volume set --image-property disk_format=qcow2).
> > 
> > 
> > 
> > 
> > 
> > Thanks!
> > 
> > -Robert
> > 
> 
> 




More information about the openstack-discuss mailing list