Issues with virsh XML changing disk type after Nova live migration

Mohammed Naser mnaser at vexxhost.com
Thu Feb 9 18:47:46 UTC 2023


On Thu, Feb 9, 2023 at 11:23 AM Budden, Robert M. (GSFC-606.2)[InuTeq, LLC]
<robert.m.budden at nasa.gov> wrote:

> Hello Community,
>
>
>
> We’ve hit a rather pesky bug that I’m hoping someone else has seen before.
>
>
>
> We have an issue with Nova where a set of Cinder backed VMs are having
> their XML definitions modified after a live migration. Specifically, the
> destination host ends up having the disk type changed from ‘qcow2’ to
> ‘raw’. This ends up with the VM becoming unbootable upon the next hard
> reboot (or nova stop/start is issued). The required fix ATM is for us to
> destroy the VM and recreate from the persistent Cinder volume. Clearly this
> isn’t a maintainable solution as we rely on live migration for patching
> infrastructure.
>

Can you share the bits of the libvirt XML that are changing?  I'm curious
to know what is your storage backend as well (Ceph?  LVM with Cinder?)


>
>
Any thoughts, ideas, would be most welcome. Gory details are below.
>
>
>
> Here’s the key details we have:
>
>
>
>    - Nova boot an instance from the existing volume works as expected.
>    - After live migration the ‘type’ field in the virsh XML is changed
>    from ‘qcow2’ -> ‘raw’ and we get a ‘No bootable device’ from the VM
>    (rightly so)
>    - After reverting this field automatically (scripted solution), a
>    ‘nova stop’ followed by a ‘nova start’ yet again rewrites the XML with the
>    bad type=’raw’.
>    - It should be noted that before a live migration is performed ‘nova
>    stop/start’ functions as expected, no XML changes are written to the virsh
>    definition.
>    - Injecting additional logs into the python on these two hosts I’ve
>    narrowed it down to ‘bdm_info.connection_info’ on the destination end is
>    choosing the ‘raw’ parameter somehow (I’ve only got so far through the code
>    at this time). The etree.XMLParser of the source hypervisors XML definition
>    is properly parsing out ‘qcow2’ type.
>
>
I'm kinda curious how you're hosting `qcow2` inside of a storage backend,
usually, the storage backends want raw images only...  Is there any chance
you've played with the images and are the images that these cinder volumes
by raw/qcow2?


> Some background info:
>
>    - We’re currently running the Wallaby release with the latest patches.
>    - Hybrid OS (Stream8 /Rocky 8) on underlying hardware with the
>    majority of the Control Plane Stream aside from our Neutron Network Nodes.
>    Computes are roughly split 50/50 Stream/Rocky.
>    - The Cinder volumes that experience this were copied in from a
>    previous OpenStack cloud (Pike/Queens) on the backend. I.e. NetApp
>    snapmirror, new bootable Cinder volume created on the backend, and an
>    internal NetApp operation for a zero copy operation over the backing Cinder
>    file.
>       - Other Cinder volumes that don’t exhibit this to mostly be in
>       ‘raw’ format already (we haven’t vetted every single bootable Cinder volume
>       yet).
>    - We’ve noticed these Cinder volumes lack some metadata fields that
>    other Cinder volumes create by Glance have (more details below).
>
>
Are those old running VMs or an old cloud?  I wonder if those are old
records that became raw with the upgrade *by default* and now you're stuck
in this weird spot.  If you create new volumes, are they qcow2 or raw?



>
>    -
>
> Ideas we’ve tried:
>
>    - Adjusting settings on both computes for ‘use_cow_images’ and
>    ‘force_raw_images’ seem to have zero effect.
>    - Manually setting the Cinder metadata parameters to no avail (i.e.
>    openstack volume set --image-property disk_format=qcow2).
>
>
>
>
>
> Thanks!
>
> -Robert
>


-- 
Mohammed Naser
VEXXHOST, Inc.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.openstack.org/pipermail/openstack-discuss/attachments/20230209/48edbd92/attachment-0001.htm>


More information about the openstack-discuss mailing list