Dear Community,
We are running OpenStack 2023.1 with Ceph as the backend storage on a 3-node deployment.
Recently, we faced a scenario where two of our servers became unresponsive (hung state), and we had to reboot them. During this time, VMs running on the affected compute node started reporting I/O errors inside the guest OS, such as:
[ 33.911093] blk_update_request: I/O error, dev vda, sector 229880 op 0x1:(WRITE) flags 0x800 phys_seg 2 prio class 0 [ 33.914953] Buffer I/O error on dev vda1, logical block 319, lost async page write [ 33.914953] Buffer I/O error on dev vda1, logical block 320, lost async page write [ 33.927594] blk_update_request: I/O error, dev vda, sector 229904 op 0x1:(WRITE) flags 0x800 phys_seg 2 prio class 0
It appears that when Ceph becomes unavailable (or quorum is lost), the VMs continue attempting writes, which results in I/O errors at the guest OS level.
Pause or block writes from active VMs when Ceph storage is unavailable
Avoid guest OS filesystem corruption
Ensure safer recovery when Ceph services are restored