I have not tried disabling it. When things settle down I will be increasing the timeout to something like 30 seconds to 60 seconds depending on testing.
I was mostly surprised that ceph still created the volume and cinder didn't account for it. so it bailed on the whole transaction.


From: Eugen Block <eblock@nde.ag>
Sent: Monday, November 13, 2023 3:48 AM
To: openstack-discuss@lists.openstack.org <openstack-discuss@lists.openstack.org>
Subject: Re: [ops][cinder][kolla] Cinder + Ceph under high load can lead to timeouts and a failure state.
 
Hi,

looks like that change was committed was added quite some time ago 
[1], but I didn't really check in detail if there has been more work 
on that. But apparently, having no timeout (rados_connect_timeout = 
-1) also caused problems with cinder-volume. We use the cinder.conf 
default value which also disables the timeout, it still seems to be 
the case in Bobcat [2].
Maybe there are specific circumstances which require a timeout, the 
bug report [3] doesn't have too many details about the setup. Have you 
tried disabling it to verify it would work for you?

Regards,
Eugen

[1] 
https://opendev.org/openstack/kolla-ansible/commit/29a4b1996129c97b096637969dc3d1308399fda4
[2] 
https://docs.openstack.org/cinder/latest/configuration/block-storage/samples/cinder.conf.html
[3] https://bugs.launchpad.net/kolla-ansible/+bug/1676267

Zitat von Forrest Fuqua <fffics@rit.edu>:

> In my setup of Cinder Zed + Ceph Quincy under very high loads there 
> is a race condition due to a low timeout by default in Openstack 
> Kolla in Cinder.conf of rados_connect_timeout = 5
>
> This can lead to a duplicate volume trying to be made on ceph and 
> cinder erroring out a VM Build in Nova.
>
> Nova:
> nova.exception.BuildAbortException: Build of instance 
> a529f7d5-5a81-44f6-a1fd-93caa39b2439 aborted: Volume 
> b476094b-075e-4dc7-b63e-5bb58a0a9229 did not finish being created 
> even after we waited 30 seconds or 4 attempts. And its status is 
> error.
>
> Cinder-Volume:
> File 
> "/var/lib/kolla/venv/lib/python3.9/site-packages/cinder/volume/drivers/rbd.py", line 230, in __init__\n    self.volume = driver.rbd.Image(rados_ioctx,\n', '  File "rbd.pyx", line 2894, in rbd.Image.__init__\n', "rbd.Timeout: [errno 110] RBD operation timeout (error opening image b'volume-b476094b-075e-4dc7-b63e-5bb58a0a9229' at snapshot 
> None)\n"]
> ---
>  File 
> "/var/lib/kolla/venv/lib/python3.9/site-packages/eventlet/tpool.py", 
> line 132, in execute\n    six.reraise(c, e, tb)\n', '  File 
> "/var/lib/kolla/venv/lib/python3.9/site-packages/six.py", line 719, 
> in reraise\n    raise value\n', '  File 
> "/var/lib/kolla/venv/lib/python3.9/site-packages/eventlet/tpool.py", 
> line 86, in tworker\n    rv = meth(*args, **kwargs)\n', '  File 
> "rbd.pyx", line 698, in rbd.RBD.clone\n', 'rbd.ImageExists: [errno 
> 17] RBD image already exists (error creating clone)\n']
>
> ---
> I can confirm that the volume does exist in Ceph now but no longer 
> being tracked in Cinder, leading to a leftover artifact that eats up 
> resources and a failed VM build.
> This seems like behavior that should be accounted for in some way.