Hello Thamanna, The scenario you described, implies your Ceph cluster being that broken it is simply unable to serve any I/O any more. Your virtual machine work load literally experiences the storage being taken away. There is no remedy to cope with that. So, if you're asking about the best practice to recover from such issues: make back-ups (no snapshots, those are not backups) that you periodically test that you can restore. Meanwhile, as mentioned before, I'd suggest to understand why the Ceph nodes were stuck in the first place. Cheers, Kees __ Kees Meijs BICT Nefos Cloud & IT <https://nefos.com/contact> Nefos IT bv Burgemeester Mollaan 34a 5582 CK Waalre - NL kvk 66494931 +31 (0)88 2088 188 <tel:+31882088188> nefos.com <https://nefos.com/contact> The information contained in this message is intended for the addressee only and may contain classified information. If you are not the addressee, please delete this message and notify the sender; you should not copy or distribute this message or disclose its contents to anyone. Any views or opinions expressed in this message are those of the individual(s) and not necessarily of the organization. No reliance may be placed on this message without written confirmation from an authorised representative of its contents. No guarantee is implied that this message or any attachment is virus free or has not been intercepted and amended. General terms and conditions ("The NLdigital Terms") apply to all our products and services. On 03/03/2026 04:57, Thamanna Farhath wrote:
Thank you for your clarification. We understand that this behavior is by design in Ceph and that OpenStack Nova will not automatically take action when storage becomes unavailable.
However, in our case, simply rebooting the affected VMs is not always sufficient. If a crash occurs and persistent I/O errors are seen inside the guest, we would like to understand the recommended recovery procedure.
In such scenarios, how can we safely retrieve and restore the instance once Ceph regains quorum? What is the best practice to recover RBD-backed instances after write failures to avoid permanent corruption?