Hello community,

Thank you for your explanation and suggestions.

Initially, we suspected that the issue was related to Ceph. However, after further testing in our lab environment where both Ceph and OpenStack are deployed together, we observed that the problem appears specifically when a compute node goes down or reboots.

In this scenario, the instances that were running on the affected compute node experience persistent I/O errors inside the guest, even after the Ceph cluster regains quorum and becomes healthy again.

We also tested instance evacuation to another compute node, but the evacuated instance still encounters the same I/O errors.

From our observations, the issue seems to occur only when the compute node failure interrupts ongoing RBD I/O operations, and the instance does not recover properly afterward.

Could you please advise on the recommended recovery procedure for such situations?

Specifically:

Is there a best practice to recover RBD-backed instances after compute node failure when I/O errors appear inside the guest?
Are there any recommended Ceph or Nova configurations to prevent this situation or improve recovery?

Any guidance would be greatly appreciated.

Best regards,

Thamanna Farhath

From: Eugen Block <eblock@nde.ag>
To: <openstack-discuss@lists.openstack.org>
Date: Wed, 04 Mar 2026 02:10:12 +0530
Subject: Re: [Nova][vms]Preventing VM I/O Errors When Ceph OSD Nodes Go Down

And one more thought on this: if your Ceph cluster really was
inaccessible after a single node failure, then there's a big knowledge
gap how to configure Ceph in a professional way. If that is really the
case, I recommend to hire a consultant to inspect your cluster setup.

Zitat von Kees Meijs | Nefos <keesm@nefos.com>:

> Hello Thamanna,
>
> The scenario you described, implies your Ceph cluster being that
> broken it is simply unable to serve any I/O any more.
>
> Your virtual machine work load literally experiences the storage
> being taken away. There is no remedy to cope with that.
>
> So, if you're asking about the best practice to recover from such
> issues: make back-ups (no snapshots, those are not backups) that you
> periodically test that you can restore.
>
> Meanwhile, as mentioned before, I'd suggest to understand why the
> Ceph nodes were stuck in the first place.
>
> Cheers,
> Kees
>
> __
>
> Kees Meijs BICT
>
> Nefos Cloud & IT <https://nefos.com/contact>
>
> Nefos IT bv
> Burgemeester Mollaan 34a
> 5582 CK Waalre - NL
> kvk 66494931
>
> +31 (0)88 2088 188 <tel:+31882088188>
> nefos.com <https://nefos.com/contact>
>
>
> The information contained in this message is intended for the
> addressee only and may contain classified information. If you are
> not the addressee, please delete this message and notify the sender;
> you should not copy or distribute this message or disclose its
> contents to anyone. Any views or opinions expressed in this message
> are those of the individual(s) and not necessarily of the
> organization. No reliance may be placed on this message without
> written confirmation from an authorised representative of its
> contents. No guarantee is implied that this message or any
> attachment is virus free or has not been intercepted and amended.
>
> General terms and conditions ("The NLdigital Terms") apply to all
> our products and services.
>
>
> On 03/03/2026 04:57, Thamanna Farhath wrote:
>>
>> Thank you for your clarification. We understand that this behavior
>> is by design in Ceph and that OpenStack Nova will not automatically
>> take action when storage becomes unavailable.
>>
>> However, in our case, simply rebooting the affected VMs is not
>> always sufficient. If a crash occurs and persistent I/O errors are
>> seen inside the guest, we would like to understand the recommended
>> recovery procedure.
>>
>> In such scenarios, how can we safely retrieve and restore the
>> instance once Ceph regains quorum?
>> What is the best practice to recover RBD-backed instances after
>> write failures to avoid permanent corruption?

E-mail transmission cannot be guaranteed to be secured or error free as information could be intercepted, corrupted, lost, destroyed, arrive late, incomplete, or may contain viruses. Therefore, we do not accept liability for any errors or omissions in the contents of this message, which arise as a result of e-mail transmission. The recipient should check this e-mail and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email."