Cinder API timeout on single-control node
eblock at nde.ag
Fri Sep 10 07:54:37 UTC 2021
I enabled debug logs for cinder and restarted cinder-api but then the
issue was not reproducable. Suddenly all api calls (even 'cinder list
--all') took less than a minute so we haven't seen timeouts since
then. I turned off debug logs and am still waiting for this to
reoccur. In the meantime they also deleted 600 unused volumes, that
probably helped, too.
Anyway, I doubled the rpc_response_timeout in cinder.conf now and will
wait until this happens again. I don't find anything like a cinder
wsgi timeout for a vhost, also there's no haproxy involved because
it's only one control node deployed with chef and crowbar.
I'll report back if anything happens.
Zitat von Tony Liu <tonyliu0592 at hotmail.com>:
> The first issue is 504 timeout, update timeout in haproxy helps on that.
> The next is the timeout from cinder-api,  helps on that.
> Then the next is rbd client timeout. I started "debug RBD timeout
> issue" thread for that.
> It seems that the root cause of this series timeout is from Ceph.
> I followed comments from Konstantin to use msgr2 only.
> Hopefully that will fix the whole timeout issue.
>  https://bugzilla.redhat.com/show_bug.cgi?id=1930806
> From: Eugen Block <eblock at nde.ag>
> Sent: September 1, 2021 05:02 AM
> To: openstack-discuss at lists.openstack.org
> Subject: Cinder API timeout on single-control node
> Hi *,
> since your last responses were quite helpful regarding rabbitmq I
> would like to ask a different question for the same environment. It's
> an older openstack version (Pike) running with only one control node.
> There already were lots of volumes (way more than 1000) in that cloud,
> but after adding a bunch more (not sure how many exactly) in one
> project the whole cinder api became extremely slow. Both horizon and
> CLI run into timeouts:
> [Wed Sep 01 13:18:52.109178 2021] [wsgi:error] [pid 60440] [client
> <IP>:58474] Timeout when reading response headers from daemon process
> referer: http://<control>/project/volumes/
> [Wed Sep 01 13:18:53.664714 2021] [wsgi:error] [pid 13007] Not Found:
> Sometimes the volume creation succeeds if you just retry, but it often
> fails. The dashboard shows a "504 gateway timeout" after two minutes
> (also after four minutes after I increased the timeout for the apache
> dashboard config).
> The timeout also shows even if I try to get into the volumes tab of an
> empty project.
> A couple of weeks ago I already noticed some performance issues with
> cinder api if there are lots of attached volumes, if there are many
> "available" volumes it doesn't seem to slow things down. But since
> then the total number of volumes has doubled. At the moment there are
> more than 960 attached volumes across all projects and more than 750
> detached volumes. I searched the cinder.conf for any helpful setting
> but I'm not sure which would actually help. And since it's a
> production cloud I would like to avoid restarting services all the
> time just to try something. Maybe some of you can point me in the
> right direction? I would appreciate any help!
> If there's more information I can provide just let me know.
More information about the openstack-discuss