Dear Community,

Thank you for your clarification. We understand that this behavior is by design in Ceph and that OpenStack Nova will not automatically take action when storage becomes unavailable.

However, in our case, simply rebooting the affected VMs is not always sufficient. If a crash occurs and persistent I/O errors are seen inside the guest, we would like to understand the recommended recovery procedure.

In such scenarios, how can we safely retrieve and restore the instance once Ceph regains quorum?

What is the best practice to recover RBD-backed instances after write failures to avoid permanent corruption?

Thamanna Farhath

Associate engineer - R&D

9344093591

thamanna.f@zybisys.com

zybisys.com

From: Sean Mooney <smooney@redhat.com>
To: <openstack-discuss@lists.openstack.org>
Date: Wed, 25 Feb 2026 18:20:07 +0530
Subject: Re: [Nova][vms]Preventing VM I/O Errors When Ceph OSD Nodes Go Down

its also intentionl that the guest get teh io errors as that is the back
presusure/error reporting
mechanisium that ceph/linux is intened to provide to userspace application.

nova does not and will not monitor the ceph status for you and it wont
pause the vms if ceph goes down
and that is by design. nova will not take actions on the vm like that
unless requested to do so via its api.

On 25/02/2026 08:05, Kees Meijs | Nefos wrote:
> Hi Thamanna,
>
> Ceph's built to ensure you do not lose any data, so assuming your
> pools are configured using replicas in threefold, after losing two
> nodes your PGs should be unavailable. This is by design, and for your
> own safety.
>
> I understand it is not what you're asking for, but my two cents would
> be to understand why the Ceph nodes hung in the first place and act
> upon it (fix hardware, maybe upgrade firmware, maybe fix software if
> applicable). Or in addition extend your Ceph cluster, if there's
> budget for that. (Always go for dual power supplies and don't even
> consider non-ECC RAM.)
>
> Cheers,
> Kees
>
> On 25/02/2026 05:08, Thamanna Farhath wrote:
>>
>> Dear Community,
>>
>> We are running OpenStack 2023.1 with Ceph as the backend storage on a
>> 3-node deployment.
>>
>> Recently, we faced a scenario where two of our servers became
>> unresponsive (hung state), and we had to reboot them. During this
>> time, VMs running on the affected compute node started reporting I/O
>> errors inside the guest OS, such as:
>>
>> |[ 33.911093] blk_update_request: I/O error, dev vda, sector 229880
>> op 0x1:(WRITE) flags 0x800 phys_seg 2 prio class 0 [ 33.914953]
>> Buffer I/O error on dev vda1, logical block 319, lost async page
>> write [ 33.914953] Buffer I/O error on dev vda1, logical block 320,
>> lost async page write [ 33.927594] blk_update_request: I/O error, dev
>> vda, sector 229904 op 0x1:(WRITE) flags 0x800 phys_seg 2 prio class 0|
>>
>> It appears that when Ceph becomes unavailable (or quorum is lost),
>> the VMs continue attempting writes, which results in I/O errors at
>> the guest OS level.
>>
>> Our goal:
>> We would like to prevent guest filesystem corruption or I/O errors
>> when Ceph is down. Ideally, we want to:
>>
>> *
>>
>> Pause or block writes from active VMs when Ceph storage is
>> unavailable
>>
>> *
>>
>> Avoid guest OS filesystem corruption
>>
>> *
>>
>> Ensure safer recovery when Ceph services are restored
>>
>>
>>
>> ------------------------------------------------------------------------
>> *Disclaimer :*/The content of this email and any files transmitted
>> with it are confidential and intended solely for the use of the
>> individual or entity to which they are addressed. If you have
>> received this email in error, please notify the sender and remove the
>> messages from your system. If you are not the named addressee, it is
>> strictly forbidden for you to share, circulate, distribute or copy
>> any part of this e-mail to any third party without the written
>> consent of the sender.//
>> /
>>
>> ///
>> /
>>
>> /E-mail transmission cannot be guaranteed to be secured or error free
>> as information could be intercepted, corrupted, lost, destroyed,
>> arrive late, incomplete, or may contain viruses. Therefore, we do not
>> accept liability for any errors or omissions in the contents of this
>> message, which arise as a result of e-mail transmission. The
>> recipient should check this e-mail and any attachments for the
>> presence of viruses. The company accepts no liability for any damage
>> caused by any virus transmitted by this email."//
>> /
>>
>>
>

E-mail transmission cannot be guaranteed to be secured or error free as information could be intercepted, corrupted, lost, destroyed, arrive late, incomplete, or may contain viruses. Therefore, we do not accept liability for any errors or omissions in the contents of this message, which arise as a result of e-mail transmission. The recipient should check this e-mail and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email."