Dear Community,

We are running OpenStack 2023.1 with a 3-node Ceph cluster.

Recently, one Ceph nodes became unresponsive, resulting in quorum loss., VMs experienced I/O errors as expected.

[   33.911093] blk_update_request: I/O error, dev vda, sector 229880 op 0x1:(WRITE) flags 0x800 phys_seg 2 prio class 0
[   33.914953] Buffer I/O error on dev vda1, logical block 319, lost async page write
[   33.914953] Buffer I/O error on dev vda1, logical block 320, lost async page write
[   33.927594] blk_update_request: I/O error, dev vda, sector 229904 op 0x1:(WRITE) flags 0x800 phys_seg 2 prio class 0

We understand this behavior is by design and that Nova does not automatically act based on Ceph health.

However, our concern is recovery. In some cases, simply rebooting the VM is not sufficient, and I/O errors persist even after Ceph regains quorum.

What is the recommended best practice to safely recover RBD-backed instances in such scenarios? How can we retrieve and restore affected VMs while minimizing filesystem corruption?

Disclaimer : The content of this email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to which they are addressed. If you have received this email in error, please notify the sender and remove the messages from your system. If you are not the named addressee, it is strictly forbidden for you to share, circulate, distribute or copy any part of this e-mail to any third party without the written consent of the sender.

E-mail transmission cannot be guaranteed to be secured or error free as information could be intercepted, corrupted, lost, destroyed, arrive late, incomplete, or may contain viruses. Therefore, we do not accept liability for any errors or omissions in the contents of this message, which arise as a result of e-mail transmission. The recipient should check this e-mail and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email."