Okay, glad you brought it back. I would be curious as well to understand what happened. Zitat von Satish Patel <satish.txt@gmail.com>:
Hi Eugen,
After shelve and unshelve bring VM back to life. This is very odd and I haven't seen this behavior before.
On Thu, Feb 1, 2024 at 11:24 AM Eugen Block <eblock@nde.ag> wrote:
I’m not sure if I understand all of it, but there currently is only one cluster active? And that’s where this output is from? What does ‘rbd status’ tell you?
Zitat von Satish Patel <satish.txt@gmail.com>:
Older ceph cluster is down because everything came up so we shut down entire cluster and realized one vms stuck in this error state.. in current cluster this is what its showing
# rbd info -p volumes volume-77b123ff-915f-4e0b-8d74-d34fde12528b rbd image 'volume-77b123ff-915f-4e0b-8d74-d34fde12528b': size 120 GiB in 30720 objects order 22 (4 MiB objects) snapshot_count: 0 id: 87bfb47beb93 block_name_prefix: rbd_data.87bfb47beb93 format: 2 features: layering, exclusive-lock, object-map, fast-diff, deep-flatten, journaling op_features: flags: create_timestamp: Sun Jan 28 05:28:30 2024 access_timestamp: Thu Feb 1 15:28:57 2024 modify_timestamp: Thu Feb 1 06:17:30 2024 journal: 87bfb47beb93 mirroring state: enabled mirroring mode: journal mirroring global id: 0d488c59-cd44-47a8-86b7-c24509f7771b mirroring primary: true
On Thu, Feb 1, 2024 at 3:44 AM Eugen Block <eblock@nde.ag> wrote:
Hi,
have you compared the affected rbd images with working images? Maybe the mirroring failed for those images? Were they promoted correctly? Which mirror mode are you using, journal or snapshot? I would check the 'rbd info pool/image' output and compare to see if there's a difference.
Zitat von Satish Patel <satish.txt@gmail.com>:
Folks,
I have a ceph cluster and recently I configured rbd-mirror to replicate all data to remove ceph cluster for disaster recovery.
Yesterday for POC we did a hard cutover on ceph and point openstack to the new cluster. All other vms came back up fine but 2 vms stuck in this error state in libvirt logs
2024-01-31 22:44:37.591+0000: 474597: error : qemuMonitorJSONCheckErrorFull:399 : internal error: unable to execute QEMU command 'query-named-block-nodes': cannot read image start for probe: Permission denied
If it's a wider issue than it should impact all the vms but why only two VM stuck and not starting and libvirt giving me this error in logs
nova-compute logs also showing same error
2024-01-31 22:44:40.925 7 INFO nova.compute.manager [None req-b33485b5-8740-48ae-8b5b-a440de3f11a4 c48fcfb9347f413f92fcece065644b00 ca5c652478c7429e964257990800e9cb - - default default] [instance: 2de0f880-77c7-4d2c-9e01-898c57ad3693] Successfully reverted task state from powering-on on failure for instance. 2024-01-31 22:44:40.944 7 ERROR oslo_messaging.rpc.server [None req-b33485b5-8740-48ae-8b5b-a440de3f11a4 c48fcfb9347f413f92fcece065644b00 ca5c652478c7429e964257990800e9cb - - default default] Exception during message handling: libvirt.libvirtError: internal error: unable to execute QEMU command 'query-named-block-nodes': cannot read image start for probe: Permission denied