Issues with virsh XML changing disk type after Nova live migration

Budden, Robert M. (GSFC-606.2)[InuTeq, LLC] robert.m.budden at nasa.gov
Wed Feb 8 16:28:03 UTC 2023


Hello Community,

We’ve hit a rather pesky bug that I’m hoping someone else has seen before.

We have an issue with Nova where a set of Cinder backed VMs are having their XML definitions modified after a live migration. Specifically, the destination host ends up having the disk type changed from ‘qcow2’ to ‘raw’. This ends up with the VM becoming unbootable upon the next hard reboot (or nova stop/start is issued). The required fix ATM is for us to destroy the VM and recreate from the persistent Cinder volume. Clearly this isn’t a maintainable solution as we rely on live migration for patching infrastructure.

Any thoughts, ideas, would be most welcome. Gory details are below.

Here’s the key details we have:


  *   Nova boot an instance from the existing volume works as expected.
  *   After live migration the ‘type’ field in the virsh XML is changed from ‘qcow2’ -> ‘raw’ and we get a ‘No bootable device’ from the VM (rightly so)
  *   After reverting this field automatically (scripted solution), a ‘nova stop’ followed by a ‘nova start’ yet again rewrites the XML with the bad type=’raw’.
  *   It should be noted that before a live migration is performed ‘nova stop/start’ functions as expected, no XML changes are written to the virsh definition.
  *   Injecting additional logs into the python on these two hosts I’ve narrowed it down to ‘bdm_info.connection_info’ on the destination end is choosing the ‘raw’ parameter somehow (I’ve only got so far through the code at this time). The etree.XMLParser of the source hypervisors XML definition is properly parsing out ‘qcow2’ type.

Some background info:

  *   We’re currently running the Wallaby release with the latest patches.
  *   Hybrid OS (Stream8 /Rocky 8) on underlying hardware with the majority of the Control Plane Stream aside from our Neutron Network Nodes. Computes are roughly split 50/50 Stream/Rocky.
  *   The Cinder volumes that experience this were copied in from a previous OpenStack cloud (Pike/Queens) on the backend. I.e. NetApp snapmirror, new bootable Cinder volume created on the backend, and an internal NetApp operation for a zero copy operation over the backing Cinder file.
     *   Other Cinder volumes that don’t exhibit this to mostly be in ‘raw’ format already (we haven’t vetted every single bootable Cinder volume yet).
  *   We’ve noticed these Cinder volumes lack some metadata fields that other Cinder volumes create by Glance have (more details below).

Ideas we’ve tried:

  *   Adjusting settings on both computes for ‘use_cow_images’ and ‘force_raw_images’ seem to have zero effect.
  *   Manually setting the Cinder metadata parameters to no avail (i.e. openstack volume set --image-property disk_format=qcow2).


Thanks!
-Robert
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.openstack.org/pipermail/openstack-discuss/attachments/20230208/6490d13c/attachment-0001.htm>


More information about the openstack-discuss mailing list