Re: [Nova][Vms]Persistent I/O Errors in RBD-backed Instances After Compute Node Failure
Hi Eugen, Thank you for the guidance. The solution worked and the instance was able to start successfully again after removing the stale lock and blocklisting the old client. I had one follow-up question regarding operations in larger environments. In our production setup we run 100+ VMs, and handling these situations manually (checking RBD status, blocklisting clients, and removing locks) would be difficult if multiple hosts or instances are affected. Is there any OpenStack service or recommended approach that automatically handles stale RBD locks when a compute host goes down or during evacuation? For example, does Nova/libvirt handle lock cleanup during recovery, or are there any Ceph/OpenStack best practices or automation mechanisms to avoid manual intervention in such cases? Thanks again for your help. Best regards, Thamanna From: Kevin Le coq <kevin.lecoq@protonmail.ch> To: "Thamanna Farhath"<thamanna.f@zybisys.com> Cc: "openstack-discuss"<openstack-discuss@lists.openstack.org> Date: Sat, 07 Mar 2026 21:47:04 +0530 Subject: Re: [Nova][Vms]Persistent I/O Errors in RBD-backed Instances After Compute Node Failure Hello, From experience, this type of behavior can sometimes be related to a Ceph keyring configuration issue. In particular, if the Ceph client used by Nova does not have the appropriate capabilities, Nova may fail to properly reacquire the lock on the RBD image after the compute node goes down or reboots unexpectedly. In such cases, even after the compute node returns or the instance is evacuated to another node, the instance may continue to experience persistent I/O errors inside the guest. Could you please verify that the Ceph keyring used by Nova has the correct RBD profile enabled for the pool used by your instances? For example: [client.openstack] key = "<redacted>" caps mgr = allow * caps mon = profile rbd caps osd = profile rbd pool=vms Ensuring that the client has the proper profile rbd permissions on the relevant pool allows Nova to correctly manage and recover RBD locks when instances are restarted or moved to another compute node. It may also be helpful to check for stale locks on the affected RBD images and verify that the compute nodes are using the same Ceph client configuration. Best regards, Kevin Le coq Le samedi 7 mars 2026 à 15:49, Thamanna Farhath < mailto:thamanna.f@zybisys.com > a écrit : Hello, We are currently investigating an issue related to OpenStack (2023.1) instances using Ceph RBD storage. However, after further testing in a lab environment we observed that the problem occurs specifically when a compute node goes down or reboots unexpectedly. In this situation, the instances that were running on the affected compute node later experience persistent I/O errors inside the guest OS, even after the compute node comes back online, the instance is usually in a SHUTOFF state, and when we try to start it again, the same I/O errors appear inside the guest. We also attempted instance evacuation to another compute node, but the evacuated instance still shows the same I/O errors inside the guest. We would like to understand the recommended approach in such cases: • What is the best practice to recover RBD-backed instances after a compute node failure when persistent I/O errors appear inside the guest? • Are there any recommended configurations or operational procedures in OpenStack or Ceph to prevent or mitigate this situation? Any suggestions or guidance would be greatly appreciated. Best regards, Thamanna Farhath Disclaimer : The content of this email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to which they are addressed. If you have received this email in error, please notify the sender and remove the messages from your system. If you are not the named addressee, it is strictly forbidden for you to share, circulate, distribute or copy any part of this e-mail to any third party without the written consent of the sender. E-mail transmission cannot be guaranteed to be secured or error free as information could be intercepted, corrupted, lost, destroyed, arrive late, incomplete, or may contain viruses. Therefore, we do not accept liability for any errors or omissions in the contents of this message, which arise as a result of e-mail transmission. The recipient should check this e-mail and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email." Disclaimer : The content of this email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to which they are addressed. If you have received this email in error, please notify the sender and remove the messages from your system. If you are not the named addressee, it is strictly forbidden for you to share, circulate, distribute or copy any part of this e-mail to any third party without the written consent of the sender. E-mail transmission cannot be guaranteed to be secured or error free as information could be intercepted, corrupted, lost, destroyed, arrive late, incomplete, or may contain viruses. Therefore, we do not accept liability for any errors or omissions in the contents of this message, which arise as a result of e-mail transmission. The recipient should check this e-mail and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email."
participants (1)
-
Thamanna Farhath