Cinder API timeout on single-control node

Eugen Block eblock at nde.ag
Fri Sep 10 07:54:37 UTC 2021


Thanks, Tony.

I enabled debug logs for cinder and restarted cinder-api but then the  
issue was not reproducable. Suddenly all api calls (even 'cinder list  
--all') took less than a minute so we haven't seen timeouts since  
then. I turned off debug logs and am still waiting for this to  
reoccur. In the meantime they also deleted 600 unused volumes, that  
probably helped, too.

Anyway, I doubled the rpc_response_timeout in cinder.conf now and will  
wait until this happens again. I don't find anything like a cinder  
wsgi timeout for a vhost, also there's no haproxy involved because  
it's only one control node deployed with chef and crowbar.

I'll report back if anything happens.

Thanks!
Eugen


Zitat von Tony Liu <tonyliu0592 at hotmail.com>:

> The first issue is 504 timeout, update timeout in haproxy helps on that.
> The next is the timeout from cinder-api, [1] helps on that.
> Then the next is rbd client timeout. I started "debug RBD timeout  
> issue" thread for that.
> It seems that the root cause of this series timeout is from Ceph.
> I followed comments from Konstantin to use msgr2 only.
> Hopefully that will fix the whole timeout issue.
>
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1930806
>
>
> Thanks!
> Tony
> ________________________________________
> From: Eugen Block <eblock at nde.ag>
> Sent: September 1, 2021 05:02 AM
> To: openstack-discuss at lists.openstack.org
> Subject: Cinder API timeout on single-control node
>
> Hi *,
>
> since your last responses were quite helpful regarding rabbitmq I
> would like to ask a different question for the same environment. It's
> an older openstack version (Pike) running with only one control node.
> There already were lots of volumes (way more than 1000) in that cloud,
> but after adding a bunch more (not sure how many exactly) in one
> project the whole cinder api became extremely slow. Both horizon and
> CLI run into timeouts:
>
> [Wed Sep 01 13:18:52.109178 2021] [wsgi:error] [pid 60440] [client
> <IP>:58474] Timeout when reading response headers from daemon process
> 'horizon':
> /srv/www/openstack-dashboard/openstack_dashboard/wsgi/django.wsgi,
> referer: http://<control>/project/volumes/
> [Wed Sep 01 13:18:53.664714 2021] [wsgi:error] [pid 13007] Not Found:
> /favicon.ico
>
> Sometimes the volume creation succeeds if you just retry, but it often
> fails. The dashboard shows a "504 gateway timeout" after two minutes
> (also after four minutes after I increased the timeout for the apache
> dashboard config).
>
> The timeout also shows even if I try to get into the volumes tab of an
> empty project.
>
> A couple of weeks ago I already noticed some performance issues with
> cinder api if there are lots of attached volumes, if there are many
> "available" volumes it doesn't seem to slow things down. But since
> then the total number of volumes has doubled. At the moment there are
> more than 960 attached volumes across all projects and more than 750
> detached volumes. I searched the cinder.conf for any helpful setting
> but I'm not sure which would actually help. And since it's a
> production cloud I would like to avoid restarting services all the
> time just to try something. Maybe some of you can point me in the
> right direction? I would appreciate any help!
>
> If there's more information I can provide just let me know.
>
> Thanks!
> Eugen






More information about the openstack-discuss mailing list