[nova][libvirtd] internal error: unable to execute QEMU command 'query-named-block-nodes'
Folks, I have a ceph cluster and recently I configured rbd-mirror to replicate all data to remove ceph cluster for disaster recovery. Yesterday for POC we did a hard cutover on ceph and point openstack to the new cluster. All other vms came back up fine but 2 vms stuck in this error state in libvirt logs 2024-01-31 22:44:37.591+0000: 474597: error : qemuMonitorJSONCheckErrorFull:399 : internal error: unable to execute QEMU command 'query-named-block-nodes': cannot read image start for probe: Permission denied If it's a wider issue than it should impact all the vms but why only two VM stuck and not starting and libvirt giving me this error in logs nova-compute logs also showing same error 2024-01-31 22:44:40.925 7 INFO nova.compute.manager [None req-b33485b5-8740-48ae-8b5b-a440de3f11a4 c48fcfb9347f413f92fcece065644b00 ca5c652478c7429e964257990800e9cb - - default default] [instance: 2de0f880-77c7-4d2c-9e01-898c57ad3693] Successfully reverted task state from powering-on on failure for instance. 2024-01-31 22:44:40.944 7 ERROR oslo_messaging.rpc.server [None req-b33485b5-8740-48ae-8b5b-a440de3f11a4 c48fcfb9347f413f92fcece065644b00 ca5c652478c7429e964257990800e9cb - - default default] Exception during message handling: libvirt.libvirtError: internal error: unable to execute QEMU command 'query-named-block-nodes': cannot read image start for probe: Permission denied
Hi, have you compared the affected rbd images with working images? Maybe the mirroring failed for those images? Were they promoted correctly? Which mirror mode are you using, journal or snapshot? I would check the 'rbd info pool/image' output and compare to see if there's a difference. Zitat von Satish Patel <satish.txt@gmail.com>:
Folks,
I have a ceph cluster and recently I configured rbd-mirror to replicate all data to remove ceph cluster for disaster recovery.
Yesterday for POC we did a hard cutover on ceph and point openstack to the new cluster. All other vms came back up fine but 2 vms stuck in this error state in libvirt logs
2024-01-31 22:44:37.591+0000: 474597: error : qemuMonitorJSONCheckErrorFull:399 : internal error: unable to execute QEMU command 'query-named-block-nodes': cannot read image start for probe: Permission denied
If it's a wider issue than it should impact all the vms but why only two VM stuck and not starting and libvirt giving me this error in logs
nova-compute logs also showing same error
2024-01-31 22:44:40.925 7 INFO nova.compute.manager [None req-b33485b5-8740-48ae-8b5b-a440de3f11a4 c48fcfb9347f413f92fcece065644b00 ca5c652478c7429e964257990800e9cb - - default default] [instance: 2de0f880-77c7-4d2c-9e01-898c57ad3693] Successfully reverted task state from powering-on on failure for instance. 2024-01-31 22:44:40.944 7 ERROR oslo_messaging.rpc.server [None req-b33485b5-8740-48ae-8b5b-a440de3f11a4 c48fcfb9347f413f92fcece065644b00 ca5c652478c7429e964257990800e9cb - - default default] Exception during message handling: libvirt.libvirtError: internal error: unable to execute QEMU command 'query-named-block-nodes': cannot read image start for probe: Permission denied
Older ceph cluster is down because everything came up so we shut down entire cluster and realized one vms stuck in this error state.. in current cluster this is what its showing # rbd info -p volumes volume-77b123ff-915f-4e0b-8d74-d34fde12528b rbd image 'volume-77b123ff-915f-4e0b-8d74-d34fde12528b': size 120 GiB in 30720 objects order 22 (4 MiB objects) snapshot_count: 0 id: 87bfb47beb93 block_name_prefix: rbd_data.87bfb47beb93 format: 2 features: layering, exclusive-lock, object-map, fast-diff, deep-flatten, journaling op_features: flags: create_timestamp: Sun Jan 28 05:28:30 2024 access_timestamp: Thu Feb 1 15:28:57 2024 modify_timestamp: Thu Feb 1 06:17:30 2024 journal: 87bfb47beb93 mirroring state: enabled mirroring mode: journal mirroring global id: 0d488c59-cd44-47a8-86b7-c24509f7771b mirroring primary: true On Thu, Feb 1, 2024 at 3:44 AM Eugen Block <eblock@nde.ag> wrote:
Hi,
have you compared the affected rbd images with working images? Maybe the mirroring failed for those images? Were they promoted correctly? Which mirror mode are you using, journal or snapshot? I would check the 'rbd info pool/image' output and compare to see if there's a difference.
Zitat von Satish Patel <satish.txt@gmail.com>:
Folks,
I have a ceph cluster and recently I configured rbd-mirror to replicate all data to remove ceph cluster for disaster recovery.
Yesterday for POC we did a hard cutover on ceph and point openstack to the new cluster. All other vms came back up fine but 2 vms stuck in this error state in libvirt logs
2024-01-31 22:44:37.591+0000: 474597: error : qemuMonitorJSONCheckErrorFull:399 : internal error: unable to execute QEMU command 'query-named-block-nodes': cannot read image start for probe: Permission denied
If it's a wider issue than it should impact all the vms but why only two VM stuck and not starting and libvirt giving me this error in logs
nova-compute logs also showing same error
2024-01-31 22:44:40.925 7 INFO nova.compute.manager [None req-b33485b5-8740-48ae-8b5b-a440de3f11a4 c48fcfb9347f413f92fcece065644b00 ca5c652478c7429e964257990800e9cb - - default default] [instance: 2de0f880-77c7-4d2c-9e01-898c57ad3693] Successfully reverted task state from powering-on on failure for instance. 2024-01-31 22:44:40.944 7 ERROR oslo_messaging.rpc.server [None req-b33485b5-8740-48ae-8b5b-a440de3f11a4 c48fcfb9347f413f92fcece065644b00 ca5c652478c7429e964257990800e9cb - - default default] Exception during message handling: libvirt.libvirtError: internal error: unable to execute QEMU command 'query-named-block-nodes': cannot read image start for probe: Permission denied
I’m not sure if I understand all of it, but there currently is only one cluster active? And that’s where this output is from? What does ‘rbd status’ tell you? Zitat von Satish Patel <satish.txt@gmail.com>:
Older ceph cluster is down because everything came up so we shut down entire cluster and realized one vms stuck in this error state.. in current cluster this is what its showing
# rbd info -p volumes volume-77b123ff-915f-4e0b-8d74-d34fde12528b rbd image 'volume-77b123ff-915f-4e0b-8d74-d34fde12528b': size 120 GiB in 30720 objects order 22 (4 MiB objects) snapshot_count: 0 id: 87bfb47beb93 block_name_prefix: rbd_data.87bfb47beb93 format: 2 features: layering, exclusive-lock, object-map, fast-diff, deep-flatten, journaling op_features: flags: create_timestamp: Sun Jan 28 05:28:30 2024 access_timestamp: Thu Feb 1 15:28:57 2024 modify_timestamp: Thu Feb 1 06:17:30 2024 journal: 87bfb47beb93 mirroring state: enabled mirroring mode: journal mirroring global id: 0d488c59-cd44-47a8-86b7-c24509f7771b mirroring primary: true
On Thu, Feb 1, 2024 at 3:44 AM Eugen Block <eblock@nde.ag> wrote:
Hi,
have you compared the affected rbd images with working images? Maybe the mirroring failed for those images? Were they promoted correctly? Which mirror mode are you using, journal or snapshot? I would check the 'rbd info pool/image' output and compare to see if there's a difference.
Zitat von Satish Patel <satish.txt@gmail.com>:
Folks,
I have a ceph cluster and recently I configured rbd-mirror to replicate all data to remove ceph cluster for disaster recovery.
Yesterday for POC we did a hard cutover on ceph and point openstack to the new cluster. All other vms came back up fine but 2 vms stuck in this error state in libvirt logs
2024-01-31 22:44:37.591+0000: 474597: error : qemuMonitorJSONCheckErrorFull:399 : internal error: unable to execute QEMU command 'query-named-block-nodes': cannot read image start for probe: Permission denied
If it's a wider issue than it should impact all the vms but why only two VM stuck and not starting and libvirt giving me this error in logs
nova-compute logs also showing same error
2024-01-31 22:44:40.925 7 INFO nova.compute.manager [None req-b33485b5-8740-48ae-8b5b-a440de3f11a4 c48fcfb9347f413f92fcece065644b00 ca5c652478c7429e964257990800e9cb - - default default] [instance: 2de0f880-77c7-4d2c-9e01-898c57ad3693] Successfully reverted task state from powering-on on failure for instance. 2024-01-31 22:44:40.944 7 ERROR oslo_messaging.rpc.server [None req-b33485b5-8740-48ae-8b5b-a440de3f11a4 c48fcfb9347f413f92fcece065644b00 ca5c652478c7429e964257990800e9cb - - default default] Exception during message handling: libvirt.libvirtError: internal error: unable to execute QEMU command 'query-named-block-nodes': cannot read image start for probe: Permission denied
Hi Eugen, After shelve and unshelve bring VM back to life. This is very odd and I haven't seen this behavior before. On Thu, Feb 1, 2024 at 11:24 AM Eugen Block <eblock@nde.ag> wrote:
I’m not sure if I understand all of it, but there currently is only one cluster active? And that’s where this output is from? What does ‘rbd status’ tell you?
Zitat von Satish Patel <satish.txt@gmail.com>:
Older ceph cluster is down because everything came up so we shut down entire cluster and realized one vms stuck in this error state.. in current cluster this is what its showing
# rbd info -p volumes volume-77b123ff-915f-4e0b-8d74-d34fde12528b rbd image 'volume-77b123ff-915f-4e0b-8d74-d34fde12528b': size 120 GiB in 30720 objects order 22 (4 MiB objects) snapshot_count: 0 id: 87bfb47beb93 block_name_prefix: rbd_data.87bfb47beb93 format: 2 features: layering, exclusive-lock, object-map, fast-diff, deep-flatten, journaling op_features: flags: create_timestamp: Sun Jan 28 05:28:30 2024 access_timestamp: Thu Feb 1 15:28:57 2024 modify_timestamp: Thu Feb 1 06:17:30 2024 journal: 87bfb47beb93 mirroring state: enabled mirroring mode: journal mirroring global id: 0d488c59-cd44-47a8-86b7-c24509f7771b mirroring primary: true
On Thu, Feb 1, 2024 at 3:44 AM Eugen Block <eblock@nde.ag> wrote:
Hi,
have you compared the affected rbd images with working images? Maybe the mirroring failed for those images? Were they promoted correctly? Which mirror mode are you using, journal or snapshot? I would check the 'rbd info pool/image' output and compare to see if there's a difference.
Zitat von Satish Patel <satish.txt@gmail.com>:
Folks,
I have a ceph cluster and recently I configured rbd-mirror to replicate all data to remove ceph cluster for disaster recovery.
Yesterday for POC we did a hard cutover on ceph and point openstack to the new cluster. All other vms came back up fine but 2 vms stuck in this error state in libvirt logs
2024-01-31 22:44:37.591+0000: 474597: error : qemuMonitorJSONCheckErrorFull:399 : internal error: unable to execute QEMU command 'query-named-block-nodes': cannot read image start for probe: Permission denied
If it's a wider issue than it should impact all the vms but why only two VM stuck and not starting and libvirt giving me this error in logs
nova-compute logs also showing same error
2024-01-31 22:44:40.925 7 INFO nova.compute.manager [None req-b33485b5-8740-48ae-8b5b-a440de3f11a4 c48fcfb9347f413f92fcece065644b00 ca5c652478c7429e964257990800e9cb - - default default] [instance: 2de0f880-77c7-4d2c-9e01-898c57ad3693] Successfully reverted task state from powering-on on failure for instance. 2024-01-31 22:44:40.944 7 ERROR oslo_messaging.rpc.server [None req-b33485b5-8740-48ae-8b5b-a440de3f11a4 c48fcfb9347f413f92fcece065644b00 ca5c652478c7429e964257990800e9cb - - default default] Exception during message handling: libvirt.libvirtError: internal error: unable to execute QEMU command 'query-named-block-nodes': cannot read image start for probe: Permission denied
Okay, glad you brought it back. I would be curious as well to understand what happened. Zitat von Satish Patel <satish.txt@gmail.com>:
Hi Eugen,
After shelve and unshelve bring VM back to life. This is very odd and I haven't seen this behavior before.
On Thu, Feb 1, 2024 at 11:24 AM Eugen Block <eblock@nde.ag> wrote:
I’m not sure if I understand all of it, but there currently is only one cluster active? And that’s where this output is from? What does ‘rbd status’ tell you?
Zitat von Satish Patel <satish.txt@gmail.com>:
Older ceph cluster is down because everything came up so we shut down entire cluster and realized one vms stuck in this error state.. in current cluster this is what its showing
# rbd info -p volumes volume-77b123ff-915f-4e0b-8d74-d34fde12528b rbd image 'volume-77b123ff-915f-4e0b-8d74-d34fde12528b': size 120 GiB in 30720 objects order 22 (4 MiB objects) snapshot_count: 0 id: 87bfb47beb93 block_name_prefix: rbd_data.87bfb47beb93 format: 2 features: layering, exclusive-lock, object-map, fast-diff, deep-flatten, journaling op_features: flags: create_timestamp: Sun Jan 28 05:28:30 2024 access_timestamp: Thu Feb 1 15:28:57 2024 modify_timestamp: Thu Feb 1 06:17:30 2024 journal: 87bfb47beb93 mirroring state: enabled mirroring mode: journal mirroring global id: 0d488c59-cd44-47a8-86b7-c24509f7771b mirroring primary: true
On Thu, Feb 1, 2024 at 3:44 AM Eugen Block <eblock@nde.ag> wrote:
Hi,
have you compared the affected rbd images with working images? Maybe the mirroring failed for those images? Were they promoted correctly? Which mirror mode are you using, journal or snapshot? I would check the 'rbd info pool/image' output and compare to see if there's a difference.
Zitat von Satish Patel <satish.txt@gmail.com>:
Folks,
I have a ceph cluster and recently I configured rbd-mirror to replicate all data to remove ceph cluster for disaster recovery.
Yesterday for POC we did a hard cutover on ceph and point openstack to the new cluster. All other vms came back up fine but 2 vms stuck in this error state in libvirt logs
2024-01-31 22:44:37.591+0000: 474597: error : qemuMonitorJSONCheckErrorFull:399 : internal error: unable to execute QEMU command 'query-named-block-nodes': cannot read image start for probe: Permission denied
If it's a wider issue than it should impact all the vms but why only two VM stuck and not starting and libvirt giving me this error in logs
nova-compute logs also showing same error
2024-01-31 22:44:40.925 7 INFO nova.compute.manager [None req-b33485b5-8740-48ae-8b5b-a440de3f11a4 c48fcfb9347f413f92fcece065644b00 ca5c652478c7429e964257990800e9cb - - default default] [instance: 2de0f880-77c7-4d2c-9e01-898c57ad3693] Successfully reverted task state from powering-on on failure for instance. 2024-01-31 22:44:40.944 7 ERROR oslo_messaging.rpc.server [None req-b33485b5-8740-48ae-8b5b-a440de3f11a4 c48fcfb9347f413f92fcece065644b00 ca5c652478c7429e964257990800e9cb - - default default] Exception during message handling: libvirt.libvirtError: internal error: unable to execute QEMU command 'query-named-block-nodes': cannot read image start for probe: Permission denied
Hi Eugen, I have noticed one more behavior. When I launch an instance from an image it works fast but when I do with volume then it will take a super long time. felt like it's doing an incompressible image or something even its RAW image. If I upload a new fresh image that works quickly no matter image or volume boot. Seems rbd-mirror replication changed some image properties which openstack doesn't understand or something causing image download and upload issues just like qcow2. I will look into it and update you on what is going on. If you have any clue please let me know. On Fri, Feb 2, 2024 at 2:44 AM Eugen Block <eblock@nde.ag> wrote:
Okay, glad you brought it back. I would be curious as well to understand what happened.
Zitat von Satish Patel <satish.txt@gmail.com>:
Hi Eugen,
After shelve and unshelve bring VM back to life. This is very odd and I haven't seen this behavior before.
On Thu, Feb 1, 2024 at 11:24 AM Eugen Block <eblock@nde.ag> wrote:
I’m not sure if I understand all of it, but there currently is only one cluster active? And that’s where this output is from? What does ‘rbd status’ tell you?
Zitat von Satish Patel <satish.txt@gmail.com>:
Older ceph cluster is down because everything came up so we shut down entire cluster and realized one vms stuck in this error state.. in current cluster this is what its showing
# rbd info -p volumes volume-77b123ff-915f-4e0b-8d74-d34fde12528b rbd image 'volume-77b123ff-915f-4e0b-8d74-d34fde12528b': size 120 GiB in 30720 objects order 22 (4 MiB objects) snapshot_count: 0 id: 87bfb47beb93 block_name_prefix: rbd_data.87bfb47beb93 format: 2 features: layering, exclusive-lock, object-map, fast-diff, deep-flatten, journaling op_features: flags: create_timestamp: Sun Jan 28 05:28:30 2024 access_timestamp: Thu Feb 1 15:28:57 2024 modify_timestamp: Thu Feb 1 06:17:30 2024 journal: 87bfb47beb93 mirroring state: enabled mirroring mode: journal mirroring global id: 0d488c59-cd44-47a8-86b7-c24509f7771b mirroring primary: true
On Thu, Feb 1, 2024 at 3:44 AM Eugen Block <eblock@nde.ag> wrote:
Hi,
have you compared the affected rbd images with working images? Maybe the mirroring failed for those images? Were they promoted correctly? Which mirror mode are you using, journal or snapshot? I would check the 'rbd info pool/image' output and compare to see if there's a difference.
Zitat von Satish Patel <satish.txt@gmail.com>:
Folks,
I have a ceph cluster and recently I configured rbd-mirror to replicate all data to remove ceph cluster for disaster recovery.
Yesterday for POC we did a hard cutover on ceph and point openstack to the new cluster. All other vms came back up fine but 2 vms stuck in this error state in libvirt logs
2024-01-31 22:44:37.591+0000: 474597: error : qemuMonitorJSONCheckErrorFull:399 : internal error: unable to execute QEMU command 'query-named-block-nodes': cannot read image start for probe: Permission denied
If it's a wider issue than it should impact all the vms but why only two VM stuck and not starting and libvirt giving me this error in logs
nova-compute logs also showing same error
2024-01-31 22:44:40.925 7 INFO nova.compute.manager [None req-b33485b5-8740-48ae-8b5b-a440de3f11a4 c48fcfb9347f413f92fcece065644b00 ca5c652478c7429e964257990800e9cb - - default default] [instance: 2de0f880-77c7-4d2c-9e01-898c57ad3693] Successfully reverted task state from powering-on on failure for instance. 2024-01-31 22:44:40.944 7 ERROR oslo_messaging.rpc.server [None req-b33485b5-8740-48ae-8b5b-a440de3f11a4 c48fcfb9347f413f92fcece065644b00 ca5c652478c7429e964257990800e9cb - - default default] Exception during message handling: libvirt.libvirtError: internal error: unable to execute QEMU command 'query-named-block-nodes': cannot read image start for probe: Permission denied
Hi, sorry for the delay. I can't imagine the rbd-mirror changing image properties, but who knows. You would see it in the nova-compute logs if it would download and re-upload an image to rbd, you would also see the download in /var/lib/nova/instances/_base on the compute node. Maybe enable debug logs to see more info why it takes so long to launch an instance from the specific image. Maybe someone update the image properties within glance or something? Zitat von Satish Patel <satish.txt@gmail.com>:
Hi Eugen,
I have noticed one more behavior. When I launch an instance from an image it works fast but when I do with volume then it will take a super long time. felt like it's doing an incompressible image or something even its RAW image.
If I upload a new fresh image that works quickly no matter image or volume boot. Seems rbd-mirror replication changed some image properties which openstack doesn't understand or something causing image download and upload issues just like qcow2.
I will look into it and update you on what is going on. If you have any clue please let me know.
On Fri, Feb 2, 2024 at 2:44 AM Eugen Block <eblock@nde.ag> wrote:
Okay, glad you brought it back. I would be curious as well to understand what happened.
Zitat von Satish Patel <satish.txt@gmail.com>:
Hi Eugen,
After shelve and unshelve bring VM back to life. This is very odd and I haven't seen this behavior before.
On Thu, Feb 1, 2024 at 11:24 AM Eugen Block <eblock@nde.ag> wrote:
I’m not sure if I understand all of it, but there currently is only one cluster active? And that’s where this output is from? What does ‘rbd status’ tell you?
Zitat von Satish Patel <satish.txt@gmail.com>:
Older ceph cluster is down because everything came up so we shut down entire cluster and realized one vms stuck in this error state.. in current cluster this is what its showing
# rbd info -p volumes volume-77b123ff-915f-4e0b-8d74-d34fde12528b rbd image 'volume-77b123ff-915f-4e0b-8d74-d34fde12528b': size 120 GiB in 30720 objects order 22 (4 MiB objects) snapshot_count: 0 id: 87bfb47beb93 block_name_prefix: rbd_data.87bfb47beb93 format: 2 features: layering, exclusive-lock, object-map, fast-diff, deep-flatten, journaling op_features: flags: create_timestamp: Sun Jan 28 05:28:30 2024 access_timestamp: Thu Feb 1 15:28:57 2024 modify_timestamp: Thu Feb 1 06:17:30 2024 journal: 87bfb47beb93 mirroring state: enabled mirroring mode: journal mirroring global id: 0d488c59-cd44-47a8-86b7-c24509f7771b mirroring primary: true
On Thu, Feb 1, 2024 at 3:44 AM Eugen Block <eblock@nde.ag> wrote:
Hi,
have you compared the affected rbd images with working images? Maybe the mirroring failed for those images? Were they promoted correctly? Which mirror mode are you using, journal or snapshot? I would check the 'rbd info pool/image' output and compare to see if there's a difference.
Zitat von Satish Patel <satish.txt@gmail.com>:
> Folks, > > I have a ceph cluster and recently I configured rbd-mirror to replicate all > data to remove ceph cluster for disaster recovery. > > Yesterday for POC we did a hard cutover on ceph and point openstack to the > new cluster. All other vms came back up fine but 2 vms stuck in this error > state in libvirt logs > > 2024-01-31 22:44:37.591+0000: 474597: error : > qemuMonitorJSONCheckErrorFull:399 : internal error: unable to execute QEMU > command 'query-named-block-nodes': cannot read image start for probe: > Permission denied > > If it's a wider issue than it should impact all the vms but why only two VM > stuck and not starting and libvirt giving me this error in logs > > nova-compute logs also showing same error > > 2024-01-31 22:44:40.925 7 INFO nova.compute.manager [None > req-b33485b5-8740-48ae-8b5b-a440de3f11a4 c48fcfb9347f413f92fcece065644b00 > ca5c652478c7429e964257990800e9cb - - default default] [instance: > 2de0f880-77c7-4d2c-9e01-898c57ad3693] Successfully reverted task state from > powering-on on failure for instance. > 2024-01-31 22:44:40.944 7 ERROR oslo_messaging.rpc.server [None > req-b33485b5-8740-48ae-8b5b-a440de3f11a4 c48fcfb9347f413f92fcece065644b00 > ca5c652478c7429e964257990800e9cb - - default default] Exception during > message handling: libvirt.libvirtError: internal error: unable to execute > QEMU command 'query-named-block-nodes': cannot read image start for probe: > Permission denied
hello friends I faced the same issue. I compared the primary and secondary but did not find any difference.
참여자 (3)
-
Eugen Block
-
hamid.lotfi@gmail.com
-
Satish Patel