Issues with virsh XML changing disk type after Nova live migration
Hello Community, We’ve hit a rather pesky bug that I’m hoping someone else has seen before. We have an issue with Nova where a set of Cinder backed VMs are having their XML definitions modified after a live migration. Specifically, the destination host ends up having the disk type changed from ‘qcow2’ to ‘raw’. This ends up with the VM becoming unbootable upon the next hard reboot (or nova stop/start is issued). The required fix ATM is for us to destroy the VM and recreate from the persistent Cinder volume. Clearly this isn’t a maintainable solution as we rely on live migration for patching infrastructure. Any thoughts, ideas, would be most welcome. Gory details are below. Here’s the key details we have: * Nova boot an instance from the existing volume works as expected. * After live migration the ‘type’ field in the virsh XML is changed from ‘qcow2’ -> ‘raw’ and we get a ‘No bootable device’ from the VM (rightly so) * After reverting this field automatically (scripted solution), a ‘nova stop’ followed by a ‘nova start’ yet again rewrites the XML with the bad type=’raw’. * It should be noted that before a live migration is performed ‘nova stop/start’ functions as expected, no XML changes are written to the virsh definition. * Injecting additional logs into the python on these two hosts I’ve narrowed it down to ‘bdm_info.connection_info’ on the destination end is choosing the ‘raw’ parameter somehow (I’ve only got so far through the code at this time). The etree.XMLParser of the source hypervisors XML definition is properly parsing out ‘qcow2’ type. Some background info: * We’re currently running the Wallaby release with the latest patches. * Hybrid OS (Stream8 /Rocky 8) on underlying hardware with the majority of the Control Plane Stream aside from our Neutron Network Nodes. Computes are roughly split 50/50 Stream/Rocky. * The Cinder volumes that experience this were copied in from a previous OpenStack cloud (Pike/Queens) on the backend. I.e. NetApp snapmirror, new bootable Cinder volume created on the backend, and an internal NetApp operation for a zero copy operation over the backing Cinder file. * Other Cinder volumes that don’t exhibit this to mostly be in ‘raw’ format already (we haven’t vetted every single bootable Cinder volume yet). * We’ve noticed these Cinder volumes lack some metadata fields that other Cinder volumes create by Glance have (more details below). Ideas we’ve tried: * Adjusting settings on both computes for ‘use_cow_images’ and ‘force_raw_images’ seem to have zero effect. * Manually setting the Cinder metadata parameters to no avail (i.e. openstack volume set --image-property disk_format=qcow2). Thanks! -Robert
On Thu, Feb 9, 2023 at 11:23 AM Budden, Robert M. (GSFC-606.2)[InuTeq, LLC] <robert.m.budden@nasa.gov> wrote:
Hello Community,
We’ve hit a rather pesky bug that I’m hoping someone else has seen before.
We have an issue with Nova where a set of Cinder backed VMs are having their XML definitions modified after a live migration. Specifically, the destination host ends up having the disk type changed from ‘qcow2’ to ‘raw’. This ends up with the VM becoming unbootable upon the next hard reboot (or nova stop/start is issued). The required fix ATM is for us to destroy the VM and recreate from the persistent Cinder volume. Clearly this isn’t a maintainable solution as we rely on live migration for patching infrastructure.
Can you share the bits of the libvirt XML that are changing? I'm curious to know what is your storage backend as well (Ceph? LVM with Cinder?)
Any thoughts, ideas, would be most welcome. Gory details are below.
Here’s the key details we have:
- Nova boot an instance from the existing volume works as expected. - After live migration the ‘type’ field in the virsh XML is changed from ‘qcow2’ -> ‘raw’ and we get a ‘No bootable device’ from the VM (rightly so) - After reverting this field automatically (scripted solution), a ‘nova stop’ followed by a ‘nova start’ yet again rewrites the XML with the bad type=’raw’. - It should be noted that before a live migration is performed ‘nova stop/start’ functions as expected, no XML changes are written to the virsh definition. - Injecting additional logs into the python on these two hosts I’ve narrowed it down to ‘bdm_info.connection_info’ on the destination end is choosing the ‘raw’ parameter somehow (I’ve only got so far through the code at this time). The etree.XMLParser of the source hypervisors XML definition is properly parsing out ‘qcow2’ type.
I'm kinda curious how you're hosting `qcow2` inside of a storage backend, usually, the storage backends want raw images only... Is there any chance you've played with the images and are the images that these cinder volumes by raw/qcow2?
Some background info:
- We’re currently running the Wallaby release with the latest patches. - Hybrid OS (Stream8 /Rocky 8) on underlying hardware with the majority of the Control Plane Stream aside from our Neutron Network Nodes. Computes are roughly split 50/50 Stream/Rocky. - The Cinder volumes that experience this were copied in from a previous OpenStack cloud (Pike/Queens) on the backend. I.e. NetApp snapmirror, new bootable Cinder volume created on the backend, and an internal NetApp operation for a zero copy operation over the backing Cinder file. - Other Cinder volumes that don’t exhibit this to mostly be in ‘raw’ format already (we haven’t vetted every single bootable Cinder volume yet). - We’ve noticed these Cinder volumes lack some metadata fields that other Cinder volumes create by Glance have (more details below).
Are those old running VMs or an old cloud? I wonder if those are old records that became raw with the upgrade *by default* and now you're stuck in this weird spot. If you create new volumes, are they qcow2 or raw?
-
Ideas we’ve tried:
- Adjusting settings on both computes for ‘use_cow_images’ and ‘force_raw_images’ seem to have zero effect. - Manually setting the Cinder metadata parameters to no avail (i.e. openstack volume set --image-property disk_format=qcow2).
Thanks!
-Robert
-- Mohammed Naser VEXXHOST, Inc.
On Thu, 2023-02-09 at 13:47 -0500, Mohammed Naser wrote:
On Thu, Feb 9, 2023 at 11:23 AM Budden, Robert M. (GSFC-606.2)[InuTeq, LLC] <robert.m.budden@nasa.gov> wrote:
Hello Community,
We’ve hit a rather pesky bug that I’m hoping someone else has seen before.
We have an issue with Nova where a set of Cinder backed VMs are having their XML definitions modified after a live migration. Specifically, the destination host ends up having the disk type changed from ‘qcow2’ to ‘raw’. This ends up with the VM becoming unbootable upon the next hard reboot (or nova stop/start is issued). The required fix ATM is for us to destroy the VM and recreate from the persistent Cinder volume. Clearly this isn’t a maintainable solution as we rely on live migration for patching infrastructure.
Can you share the bits of the libvirt XML that are changing? I'm curious to know what is your storage backend as well (Ceph? LVM with Cinder?)
if they have files and its a cinder backend then its a dirver that uses nfs as the prtocol iscsi and rbd (lvm and ceph) wont have any files for the volume. this sound like its an os-brick/ceph issue possible related to takeing snapshots of the affected vms. snapshoting cinder (nfs) volumes that are atttached vms is not currenlty supported. https://review.opendev.org/c/openstack/cinder/+/857528 https://bugs.launchpad.net/cinder/+bug/1989514 its a guess but im pretty sure that if you snapshot the volume and then its live migrated it would revert back form qcow to raw due to that bug.
Any thoughts, ideas, would be most welcome. Gory details are below.
Here’s the key details we have:
- Nova boot an instance from the existing volume works as expected. - After live migration the ‘type’ field in the virsh XML is changed from ‘qcow2’ -> ‘raw’ and we get a ‘No bootable device’ from the VM (rightly so) - After reverting this field automatically (scripted solution), a ‘nova stop’ followed by a ‘nova start’ yet again rewrites the XML with the bad type=’raw’. - It should be noted that before a live migration is performed ‘nova stop/start’ functions as expected, no XML changes are written to the virsh definition. - Injecting additional logs into the python on these two hosts I’ve narrowed it down to ‘bdm_info.connection_info’ on the destination end is choosing the ‘raw’ parameter somehow (I’ve only got so far through the code at this time). The etree.XMLParser of the source hypervisors XML definition is properly parsing out ‘qcow2’ type.
I'm kinda curious how you're hosting `qcow2` inside of a storage backend, usually, the storage backends want raw images only... Is there any chance you've played with the images and are the images that these cinder volumes by raw/qcow2?
Some background info:
- We’re currently running the Wallaby release with the latest patches. - Hybrid OS (Stream8 /Rocky 8) on underlying hardware with the majority of the Control Plane Stream aside from our Neutron Network Nodes. Computes are roughly split 50/50 Stream/Rocky. - The Cinder volumes that experience this were copied in from a previous OpenStack cloud (Pike/Queens) on the backend. I.e. NetApp snapmirror, new bootable Cinder volume created on the backend, and an internal NetApp operation for a zero copy operation over the backing Cinder file. - Other Cinder volumes that don’t exhibit this to mostly be in ‘raw’ format already (we haven’t vetted every single bootable Cinder volume yet). - We’ve noticed these Cinder volumes lack some metadata fields that other Cinder volumes create by Glance have (more details below).
Are those old running VMs or an old cloud? I wonder if those are old records that became raw with the upgrade *by default* and now you're stuck in this weird spot. If you create new volumes, are they qcow2 or raw?
-
Ideas we’ve tried:
- Adjusting settings on both computes for ‘use_cow_images’ and ‘force_raw_images’ seem to have zero effect. - Manually setting the Cinder metadata parameters to no avail (i.e. openstack volume set --image-property disk_format=qcow2).
Thanks!
-Robert
Hi Sean, Thanks for the reply! Yes, the backed are files from an NFS mount coming off a NetApp filer. More details in my reply to Mohammed. FWIW, these Cinder volumes have never been snapshotted, but the bug you mention sounds similar to what we are seeing. Thanks, -Robert On 2/9/23, 2:09 PM, "Sean Mooney" <smooney@redhat.com <mailto:smooney@redhat.com>> wrote: On Thu, 2023-02-09 at 13:47 -0500, Mohammed Naser wrote:
On Thu, Feb 9, 2023 at 11:23 AM Budden, Robert M. (GSFC-606.2)[InuTeq, LLC] <robert.m.budden@nasa.gov <mailto:robert.m.budden@nasa.gov>> wrote:
Hello Community,
We’ve hit a rather pesky bug that I’m hoping someone else has seen before.
We have an issue with Nova where a set of Cinder backed VMs are having their XML definitions modified after a live migration. Specifically, the destination host ends up having the disk type changed from ‘qcow2’ to ‘raw’. This ends up with the VM becoming unbootable upon the next hard reboot (or nova stop/start is issued). The required fix ATM is for us to destroy the VM and recreate from the persistent Cinder volume. Clearly this isn’t a maintainable solution as we rely on live migration for patching infrastructure.
Can you share the bits of the libvirt XML that are changing? I'm curious to know what is your storage backend as well (Ceph? LVM with Cinder?)
if they have files and its a cinder backend then its a dirver that uses nfs as the prtocol iscsi and rbd (lvm and ceph) wont have any files for the volume. this sound like its an os-brick/ceph issue possible related to takeing snapshots of the affected vms. snapshoting cinder (nfs) volumes that are atttached vms is not currenlty supported. https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Freview.opendev.org%2Fc%2Fopenstack%2Fcinder%2F%2B%2F857528&data=05%7C01%7Crobert.m.budden%40nasa.gov%7C4229949ffc9f47a4565108db0ad1291d%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C638115665683520373%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=u7jYFYBYtDXxLaHkjMAHT5DgIlXnie1qeUkMcfUjdWs%3D&reserved=0 <https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Freview.opendev.org%2Fc%2Fopenstack%2Fcinder%2F%2B%2F857528&data=05%7C01%7Crobert.m.budden%40nasa.gov%7C4229949ffc9f47a4565108db0ad1291d%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C638115665683520373%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=u7jYFYBYtDXxLaHkjMAHT5DgIlXnie1qeUkMcfUjdWs%3D&reserved=0> https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.launchpad.net%2Fcinder%2F%2Bbug%2F1989514&data=05%7C01%7Crobert.m.budden%40nasa.gov%7C4229949ffc9f47a4565108db0ad1291d%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C638115665683520373%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=epeJYiYc8tywUbTa2oT0UQttdENDfQzzxqggqO%2Bt7e4%3D&reserved=0 <https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.launchpad.net%2Fcinder%2F%2Bbug%2F1989514&data=05%7C01%7Crobert.m.budden%40nasa.gov%7C4229949ffc9f47a4565108db0ad1291d%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C638115665683520373%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=epeJYiYc8tywUbTa2oT0UQttdENDfQzzxqggqO%2Bt7e4%3D&reserved=0> its a guess but im pretty sure that if you snapshot the volume and then its live migrated it would revert back form qcow to raw due to that bug.
Any thoughts, ideas, would be most welcome. Gory details are below.
Here’s the key details we have:
- Nova boot an instance from the existing volume works as expected. - After live migration the ‘type’ field in the virsh XML is changed from ‘qcow2’ -> ‘raw’ and we get a ‘No bootable device’ from the VM (rightly so) - After reverting this field automatically (scripted solution), a ‘nova stop’ followed by a ‘nova start’ yet again rewrites the XML with the bad type=’raw’. - It should be noted that before a live migration is performed ‘nova stop/start’ functions as expected, no XML changes are written to the virsh definition. - Injecting additional logs into the python on these two hosts I’ve narrowed it down to ‘bdm_info.connection_info’ on the destination end is choosing the ‘raw’ parameter somehow (I’ve only got so far through the code at this time). The etree.XMLParser of the source hypervisors XML definition is properly parsing out ‘qcow2’ type.
I'm kinda curious how you're hosting `qcow2` inside of a storage backend, usually, the storage backends want raw images only... Is there any chance you've played with the images and are the images that these cinder volumes by raw/qcow2?
Some background info:
- We’re currently running the Wallaby release with the latest patches. - Hybrid OS (Stream8 /Rocky 8) on underlying hardware with the majority of the Control Plane Stream aside from our Neutron Network Nodes. Computes are roughly split 50/50 Stream/Rocky. - The Cinder volumes that experience this were copied in from a previous OpenStack cloud (Pike/Queens) on the backend. I.e. NetApp snapmirror, new bootable Cinder volume created on the backend, and an internal NetApp operation for a zero copy operation over the backing Cinder file. - Other Cinder volumes that don’t exhibit this to mostly be in ‘raw’ format already (we haven’t vetted every single bootable Cinder volume yet). - We’ve noticed these Cinder volumes lack some metadata fields that other Cinder volumes create by Glance have (more details below).
Are those old running VMs or an old cloud? I wonder if those are old records that became raw with the upgrade *by default* and now you're stuck in this weird spot. If you create new volumes, are they qcow2 or raw?
-
Ideas we’ve tried:
- Adjusting settings on both computes for ‘use_cow_images’ and ‘force_raw_images’ seem to have zero effect. - Manually setting the Cinder metadata parameters to no avail (i.e. openstack volume set --image-property disk_format=qcow2).
Thanks!
-Robert
Hi Mohammed, Thanks for the reply! Here’s a diff of the XML: <devices> <emulator>/usr/libexec/qemu-kvm</emulator> <disk type='file' device='disk'> - <driver name='qemu' type='qcow2' cache='none' io='native'/> + <driver name='qemu' type='raw' cache='none' io='native'/> <source file='/var/lib/nova/mnt/5ca8cec36979a9d9110f2038915fe227/volume-e87a5bce-1837-470f-b530-5675eba0f965' index='1'/> <backingStore/> <target dev='vda' bus='virtio'/> Our storage backends are NetApp FAS 27xx series. We have a redundant pair of filers in each AZ for collocated storage. Cinder is configured to use the NetApp driver, so it’s basically NFS backend plus some NetApp copy offload enhancements. These specific images were manually copied over from a previous OpenStack Queens cloud that used the same NetApp filer. To avoid needing to perform data copies the scripted import process went like this: * Create a new Cinder volume in our Wallaby cloud and grab the UUID. * Issue an ONTAP command to have the NetApp perform a metadata only operation to “copy” the data from the Queens Cinder volume overtop of the backend file of the new Wallaby Cinder volume created above. I suspect your final comment is what’s happening. The default backend in Wallaby appears to be raw where our old Queens cloud was qcow2. The way we did our import likely left us in this weird state. I would have assumed we could work around this by setting use_cow_images and force_raw_images appropriately, or manually setting the metadata on the Cinder volume but neither have seemed to work. Additionally, it’s a bit strange that the initial Nova boot succeeds and the problem appears to only be introduced after a live migration. Thanks, -Robert From: Mohammed Naser <mnaser@vexxhost.com> Date: Thursday, February 9, 2023 at 1:48 PM To: "Budden, Robert M. (GSFC-606.2)[InuTeq, LLC]" <robert.m.budden@nasa.gov> Cc: "openstack-discuss@lists.openstack.org" <openstack-discuss@lists.openstack.org>, "Mills, Jonathan B. (GSFC-606.2)[InuTeq, LLC]" <jonathan.b.mills@nasa.gov> Subject: [EXTERNAL] Re: Issues with virsh XML changing disk type after Nova live migration On Thu, Feb 9, 2023 at 11:23 AM Budden, Robert M. (GSFC-606.2)[InuTeq, LLC] <robert.m.budden@nasa.gov<mailto:robert.m.budden@nasa.gov>> wrote: Hello Community, We’ve hit a rather pesky bug that I’m hoping someone else has seen before. We have an issue with Nova where a set of Cinder backed VMs are having their XML definitions modified after a live migration. Specifically, the destination host ends up having the disk type changed from ‘qcow2’ to ‘raw’. This ends up with the VM becoming unbootable upon the next hard reboot (or nova stop/start is issued). The required fix ATM is for us to destroy the VM and recreate from the persistent Cinder volume. Clearly this isn’t a maintainable solution as we rely on live migration for patching infrastructure. Can you share the bits of the libvirt XML that are changing? I'm curious to know what is your storage backend as well (Ceph? LVM with Cinder?) Any thoughts, ideas, would be most welcome. Gory details are below. Here’s the key details we have: * Nova boot an instance from the existing volume works as expected. * After live migration the ‘type’ field in the virsh XML is changed from ‘qcow2’ -> ‘raw’ and we get a ‘No bootable device’ from the VM (rightly so) * After reverting this field automatically (scripted solution), a ‘nova stop’ followed by a ‘nova start’ yet again rewrites the XML with the bad type=’raw’. * It should be noted that before a live migration is performed ‘nova stop/start’ functions as expected, no XML changes are written to the virsh definition. * Injecting additional logs into the python on these two hosts I’ve narrowed it down to ‘bdm_info.connection_info’ on the destination end is choosing the ‘raw’ parameter somehow (I’ve only got so far through the code at this time). The etree.XMLParser of the source hypervisors XML definition is properly parsing out ‘qcow2’ type. I'm kinda curious how you're hosting `qcow2` inside of a storage backend, usually, the storage backends want raw images only... Is there any chance you've played with the images and are the images that these cinder volumes by raw/qcow2? Some background info: * We’re currently running the Wallaby release with the latest patches. * Hybrid OS (Stream8 /Rocky 8) on underlying hardware with the majority of the Control Plane Stream aside from our Neutron Network Nodes. Computes are roughly split 50/50 Stream/Rocky. * The Cinder volumes that experience this were copied in from a previous OpenStack cloud (Pike/Queens) on the backend. I.e. NetApp snapmirror, new bootable Cinder volume created on the backend, and an internal NetApp operation for a zero copy operation over the backing Cinder file. * Other Cinder volumes that don’t exhibit this to mostly be in ‘raw’ format already (we haven’t vetted every single bootable Cinder volume yet). * We’ve noticed these Cinder volumes lack some metadata fields that other Cinder volumes create by Glance have (more details below). Are those old running VMs or an old cloud? I wonder if those are old records that became raw with the upgrade *by default* and now you're stuck in this weird spot. If you create new volumes, are they qcow2 or raw? * Ideas we’ve tried: * Adjusting settings on both computes for ‘use_cow_images’ and ‘force_raw_images’ seem to have zero effect. * Manually setting the Cinder metadata parameters to no avail (i.e. openstack volume set --image-property disk_format=qcow2). Thanks! -Robert -- Mohammed Naser VEXXHOST, Inc.
On Mon, 2023-02-13 at 15:22 +0000, Budden, Robert M. (GSFC-606.2)[InuTeq, LLC] wrote:
Hi Mohammed,
Thanks for the reply!
Here’s a diff of the XML:
<devices>
<emulator>/usr/libexec/qemu-kvm</emulator>
<disk type='file' device='disk'>
- <driver name='qemu' type='qcow2' cache='none' io='native'/>
+ <driver name='qemu' type='raw' cache='none' io='native'/>
<source file='/var/lib/nova/mnt/5ca8cec36979a9d9110f2038915fe227/volume-e87a5bce-1837-470f-b530-5675eba0f965' index='1'/>
<backingStore/>
<target dev='vda' bus='virtio'/>
Our storage backends are NetApp FAS 27xx series. We have a redundant pair of filers in each AZ for collocated storage. Cinder is configured to use the NetApp driver, so it’s basically NFS backend plus some NetApp copy offload enhancements.
These specific images were manually copied over from a previous OpenStack Queens cloud that used the same NetApp filer. To avoid needing to perform data copies the scripted import process went like this:
* Create a new Cinder volume in our Wallaby cloud and grab the UUID. * Issue an ONTAP command to have the NetApp perform a metadata only operation to “copy” the data from the Queens Cinder volume overtop of the backend file of the new Wallaby Cinder volume created above.
I suspect your final comment is what’s happening. The default backend in Wallaby appears to be raw where our old Queens cloud was qcow2.
The way we did our import likely left us in this weird state. I would have assumed we could work around this by setting use_cow_images and force_raw_images appropriately,
use_cow_images and force_raw_images in the nova.conf apply to nova managed storage only. they do not have any effect on cidner storage.
or manually setting the metadata on the Cinder volume but neither have seemed to work.
Additionally, it’s a bit strange that the initial Nova boot succeeds and the problem appears to only be introduced after a live migration.
Thanks, -Robert
From: Mohammed Naser <mnaser@vexxhost.com> Date: Thursday, February 9, 2023 at 1:48 PM To: "Budden, Robert M. (GSFC-606.2)[InuTeq, LLC]" <robert.m.budden@nasa.gov> Cc: "openstack-discuss@lists.openstack.org" <openstack-discuss@lists.openstack.org>, "Mills, Jonathan B. (GSFC-606.2)[InuTeq, LLC]" <jonathan.b.mills@nasa.gov> Subject: [EXTERNAL] Re: Issues with virsh XML changing disk type after Nova live migration
On Thu, Feb 9, 2023 at 11:23 AM Budden, Robert M. (GSFC-606.2)[InuTeq, LLC] <robert.m.budden@nasa.gov<mailto:robert.m.budden@nasa.gov>> wrote: Hello Community,
We’ve hit a rather pesky bug that I’m hoping someone else has seen before.
We have an issue with Nova where a set of Cinder backed VMs are having their XML definitions modified after a live migration. Specifically, the destination host ends up having the disk type changed from ‘qcow2’ to ‘raw’. This ends up with the VM becoming unbootable upon the next hard reboot (or nova stop/start is issued). The required fix ATM is for us to destroy the VM and recreate from the persistent Cinder volume. Clearly this isn’t a maintainable solution as we rely on live migration for patching infrastructure.
Can you share the bits of the libvirt XML that are changing? I'm curious to know what is your storage backend as well (Ceph? LVM with Cinder?)
Any thoughts, ideas, would be most welcome. Gory details are below.
Here’s the key details we have:
* Nova boot an instance from the existing volume works as expected. * After live migration the ‘type’ field in the virsh XML is changed from ‘qcow2’ -> ‘raw’ and we get a ‘No bootable device’ from the VM (rightly so) * After reverting this field automatically (scripted solution), a ‘nova stop’ followed by a ‘nova start’ yet again rewrites the XML with the bad type=’raw’. * It should be noted that before a live migration is performed ‘nova stop/start’ functions as expected, no XML changes are written to the virsh definition. * Injecting additional logs into the python on these two hosts I’ve narrowed it down to ‘bdm_info.connection_info’ on the destination end is choosing the ‘raw’ parameter somehow (I’ve only got so far through the code at this time). The etree.XMLParser of the source hypervisors XML definition is properly parsing out ‘qcow2’ type.
I'm kinda curious how you're hosting `qcow2` inside of a storage backend, usually, the storage backends want raw images only... Is there any chance you've played with the images and are the images that these cinder volumes by raw/qcow2?
Some background info:
* We’re currently running the Wallaby release with the latest patches. * Hybrid OS (Stream8 /Rocky 8) on underlying hardware with the majority of the Control Plane Stream aside from our Neutron Network Nodes. Computes are roughly split 50/50 Stream/Rocky. * The Cinder volumes that experience this were copied in from a previous OpenStack cloud (Pike/Queens) on the backend. I.e. NetApp snapmirror, new bootable Cinder volume created on the backend, and an internal NetApp operation for a zero copy operation over the backing Cinder file.
* Other Cinder volumes that don’t exhibit this to mostly be in ‘raw’ format already (we haven’t vetted every single bootable Cinder volume yet).
* We’ve noticed these Cinder volumes lack some metadata fields that other Cinder volumes create by Glance have (more details below).
Are those old running VMs or an old cloud? I wonder if those are old records that became raw with the upgrade *by default* and now you're stuck in this weird spot. If you create new volumes, are they qcow2 or raw?
* Ideas we’ve tried:
* Adjusting settings on both computes for ‘use_cow_images’ and ‘force_raw_images’ seem to have zero effect. * Manually setting the Cinder metadata parameters to no avail (i.e. openstack volume set --image-property disk_format=qcow2).
Thanks! -Robert
-- Mohammed Naser VEXXHOST, Inc.
participants (3)
-
Budden, Robert M. (GSFC-606.2)[InuTeq, LLC]
-
Mohammed Naser
-
Sean Mooney