In my setup of Cinder Zed + Ceph Quincy
under very high loads there is a race condition due to a low timeout by default in Openstack Kolla in Cinder.conf of rados_connect_timeout = 5
This can lead to a duplicate volume trying to be made on ceph and cinder erroring out a VM Build in Nova.
Nova:
nova.exception.BuildAbortException: Build of instance a529f7d5-5a81-44f6-a1fd-93caa39b2439 aborted: Volume b476094b-075e-4dc7-b63e-5bb58a0a9229 did not finish being created even after we waited 30 seconds or 4 attempts. And its status is error.
Cinder-Volume:
File "/var/lib/kolla/venv/lib/python3.9/site-packages/cinder/volume/drivers/rbd.py", line 230, in __init__\n self.volume = driver.rbd.Image(rados_ioctx,\n', ' File "rbd.pyx", line 2894, in rbd.Image.__init__\n', "rbd.Timeout: [errno 110] RBD operation timeout
(error opening image b'volume-b476094b-075e-4dc7-b63e-5bb58a0a9229' at snapshot None)\n"]
---
File "/var/lib/kolla/venv/lib/python3.9/site-packages/eventlet/tpool.py", line 132, in execute\n six.reraise(c, e, tb)\n', ' File "/var/lib/kolla/venv/lib/python3.9/site-packages/six.py", line 719, in reraise\n raise value\n', ' File "/var/lib/kolla/venv/lib/python3.9/site-packages/eventlet/tpool.py",
line 86, in tworker\n rv = meth(*args, **kwargs)\n', ' File "rbd.pyx", line 698, in rbd.RBD.clone\n', 'rbd.ImageExists: [errno 17] RBD image already exists (error creating clone)\n']
---
I can confirm that the volume does exist in Ceph now but no longer being tracked in Cinder, leading to a leftover artifact that eats up resources and a failed VM build.
This seems like behavior that should be accounted for in some way.