Dear Community,
We are running OpenStack 2023.1 with a 3-node Ceph cluster.
Recently, one Ceph nodes became unresponsive, resulting in quorum loss., VMs experienced I/O errors as expected.
[ 33.911093] blk_update_request: I/O error, dev vda, sector 229880 op 0x1:(WRITE) flags 0x800 phys_seg 2 prio class 0 [ 33.914953] Buffer I/O error on dev vda1, logical block 319, lost async page write [ 33.914953] Buffer I/O error on dev vda1, logical block 320, lost async page write [ 33.927594] blk_update_request: I/O error, dev vda, sector 229904 op 0x1:(WRITE) flags 0x800 phys_seg 2 prio class 0
However, our concern is recovery. In some cases, simply rebooting the VM is not sufficient, and I/O errors persist even after Ceph regains quorum.
What is the recommended best practice to safely recover RBD-backed instances in such scenarios? How can we retrieve and restore affected VMs while minimizing filesystem corruption?