Cinder API timeout on single-control node
Hi *, since your last responses were quite helpful regarding rabbitmq I would like to ask a different question for the same environment. It's an older openstack version (Pike) running with only one control node. There already were lots of volumes (way more than 1000) in that cloud, but after adding a bunch more (not sure how many exactly) in one project the whole cinder api became extremely slow. Both horizon and CLI run into timeouts: [Wed Sep 01 13:18:52.109178 2021] [wsgi:error] [pid 60440] [client <IP>:58474] Timeout when reading response headers from daemon process 'horizon': /srv/www/openstack-dashboard/openstack_dashboard/wsgi/django.wsgi, referer: http://<control>/project/volumes/ [Wed Sep 01 13:18:53.664714 2021] [wsgi:error] [pid 13007] Not Found: /favicon.ico Sometimes the volume creation succeeds if you just retry, but it often fails. The dashboard shows a "504 gateway timeout" after two minutes (also after four minutes after I increased the timeout for the apache dashboard config). The timeout also shows even if I try to get into the volumes tab of an empty project. A couple of weeks ago I already noticed some performance issues with cinder api if there are lots of attached volumes, if there are many "available" volumes it doesn't seem to slow things down. But since then the total number of volumes has doubled. At the moment there are more than 960 attached volumes across all projects and more than 750 detached volumes. I searched the cinder.conf for any helpful setting but I'm not sure which would actually help. And since it's a production cloud I would like to avoid restarting services all the time just to try something. Maybe some of you can point me in the right direction? I would appreciate any help! If there's more information I can provide just let me know. Thanks! Eugen
The first issue is 504 timeout, update timeout in haproxy helps on that. The next is the timeout from cinder-api, [1] helps on that. Then the next is rbd client timeout. I started "debug RBD timeout issue" thread for that. It seems that the root cause of this series timeout is from Ceph. I followed comments from Konstantin to use msgr2 only. Hopefully that will fix the whole timeout issue. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1930806 Thanks! Tony ________________________________________ From: Eugen Block <eblock@nde.ag> Sent: September 1, 2021 05:02 AM To: openstack-discuss@lists.openstack.org Subject: Cinder API timeout on single-control node Hi *, since your last responses were quite helpful regarding rabbitmq I would like to ask a different question for the same environment. It's an older openstack version (Pike) running with only one control node. There already were lots of volumes (way more than 1000) in that cloud, but after adding a bunch more (not sure how many exactly) in one project the whole cinder api became extremely slow. Both horizon and CLI run into timeouts: [Wed Sep 01 13:18:52.109178 2021] [wsgi:error] [pid 60440] [client <IP>:58474] Timeout when reading response headers from daemon process 'horizon': /srv/www/openstack-dashboard/openstack_dashboard/wsgi/django.wsgi, referer: http://<control>/project/volumes/ [Wed Sep 01 13:18:53.664714 2021] [wsgi:error] [pid 13007] Not Found: /favicon.ico Sometimes the volume creation succeeds if you just retry, but it often fails. The dashboard shows a "504 gateway timeout" after two minutes (also after four minutes after I increased the timeout for the apache dashboard config). The timeout also shows even if I try to get into the volumes tab of an empty project. A couple of weeks ago I already noticed some performance issues with cinder api if there are lots of attached volumes, if there are many "available" volumes it doesn't seem to slow things down. But since then the total number of volumes has doubled. At the moment there are more than 960 attached volumes across all projects and more than 750 detached volumes. I searched the cinder.conf for any helpful setting but I'm not sure which would actually help. And since it's a production cloud I would like to avoid restarting services all the time just to try something. Maybe some of you can point me in the right direction? I would appreciate any help! If there's more information I can provide just let me know. Thanks! Eugen
Thanks, Tony. I enabled debug logs for cinder and restarted cinder-api but then the issue was not reproducable. Suddenly all api calls (even 'cinder list --all') took less than a minute so we haven't seen timeouts since then. I turned off debug logs and am still waiting for this to reoccur. In the meantime they also deleted 600 unused volumes, that probably helped, too. Anyway, I doubled the rpc_response_timeout in cinder.conf now and will wait until this happens again. I don't find anything like a cinder wsgi timeout for a vhost, also there's no haproxy involved because it's only one control node deployed with chef and crowbar. I'll report back if anything happens. Thanks! Eugen Zitat von Tony Liu <tonyliu0592@hotmail.com>:
The first issue is 504 timeout, update timeout in haproxy helps on that. The next is the timeout from cinder-api, [1] helps on that. Then the next is rbd client timeout. I started "debug RBD timeout issue" thread for that. It seems that the root cause of this series timeout is from Ceph. I followed comments from Konstantin to use msgr2 only. Hopefully that will fix the whole timeout issue.
[1] https://bugzilla.redhat.com/show_bug.cgi?id=1930806
Thanks! Tony ________________________________________ From: Eugen Block <eblock@nde.ag> Sent: September 1, 2021 05:02 AM To: openstack-discuss@lists.openstack.org Subject: Cinder API timeout on single-control node
Hi *,
since your last responses were quite helpful regarding rabbitmq I would like to ask a different question for the same environment. It's an older openstack version (Pike) running with only one control node. There already were lots of volumes (way more than 1000) in that cloud, but after adding a bunch more (not sure how many exactly) in one project the whole cinder api became extremely slow. Both horizon and CLI run into timeouts:
[Wed Sep 01 13:18:52.109178 2021] [wsgi:error] [pid 60440] [client <IP>:58474] Timeout when reading response headers from daemon process 'horizon': /srv/www/openstack-dashboard/openstack_dashboard/wsgi/django.wsgi, referer: http://<control>/project/volumes/ [Wed Sep 01 13:18:53.664714 2021] [wsgi:error] [pid 13007] Not Found: /favicon.ico
Sometimes the volume creation succeeds if you just retry, but it often fails. The dashboard shows a "504 gateway timeout" after two minutes (also after four minutes after I increased the timeout for the apache dashboard config).
The timeout also shows even if I try to get into the volumes tab of an empty project.
A couple of weeks ago I already noticed some performance issues with cinder api if there are lots of attached volumes, if there are many "available" volumes it doesn't seem to slow things down. But since then the total number of volumes has doubled. At the moment there are more than 960 attached volumes across all projects and more than 750 detached volumes. I searched the cinder.conf for any helpful setting but I'm not sure which would actually help. And since it's a production cloud I would like to avoid restarting services all the time just to try something. Maybe some of you can point me in the right direction? I would appreciate any help!
If there's more information I can provide just let me know.
Thanks! Eugen
participants (2)
-
Eugen Block
-
Tony Liu