[kolla][nova][cinder] Got Gateway-Timeout error on VM evacuation if it has volume attached.
Hi everyone, I met an issue when try to evacuate host. The platform is stable/rocky and using kolla-ansible to deploy. And all storage backends are connected to Ceph. Before I try to evacuate host, the source host had about 24 VMs running. When I shutdown the node and execute evacuation, there're few VMs failed. The error code is 504. Strange is those VMs are all attach its own volume. Then I check nova-compute log, a detailed error has pasted at below link; https://pastebin.com/uaE7YrP1 Does anyone have any experience with this? I googled but no enough information about this. Thanks!
On 7/18/2019 3:53 AM, Eddie Yen wrote:
Before I try to evacuate host, the source host had about 24 VMs running. When I shutdown the node and execute evacuation, there're few VMs failed. The error code is 504. Strange is those VMs are all attach its own volume.
Then I check nova-compute log, a detailed error has pasted at below link; https://pastebin.com/uaE7YrP1
Does anyone have any experience with this? I googled but no enough information about this.
Are there errors in the cinder-api logs during the evacuate of all VMs from the host? Are you doing the evacuate operation on all VMs on the host concurrently or in serial? I wonder if you're over-loading cinder and that's causing the timeout somehow. The timeout from cinder is when deleting volume attachment records, which would be terminating connections in the storage backend under the covers. Check the cinder-volume logs for errors as well. -- Thanks, Matt
Hi Matt, thanks for your reply first. The log I paste is from nova-compute. And I also check cinder-api & cinder-volume logs according from timestamp. Strange is, no error messages found during that time. I remember I launch evacuation on the host. Perhaps it's over-loading but it's not happening on the cinder. Because the environment is 3 all-in-one installation model. That means control+compute per node, and 3 nodes become controller HA. When I shutdown one of the node, I found all requests from API is pretty slow (can feel that when using dashboard.) And back to normal again when the node is back. I'll try do the evacuation again but with just disable nova host or stop nova services, to test if that happen again or not. Matt Riedemann <mriedemos@gmail.com> 於 2019年7月23日 週二 上午6:40寫道:
On 7/18/2019 3:53 AM, Eddie Yen wrote:
Before I try to evacuate host, the source host had about 24 VMs running. When I shutdown the node and execute evacuation, there're few VMs failed. The error code is 504. Strange is those VMs are all attach its own volume.
Then I check nova-compute log, a detailed error has pasted at below link; https://pastebin.com/uaE7YrP1
Does anyone have any experience with this? I googled but no enough information about this.
Are there errors in the cinder-api logs during the evacuate of all VMs from the host? Are you doing the evacuate operation on all VMs on the host concurrently or in serial? I wonder if you're over-loading cinder and that's causing the timeout somehow. The timeout from cinder is when deleting volume attachment records, which would be terminating connections in the storage backend under the covers. Check the cinder-volume logs for errors as well.
--
Thanks,
Matt
On 23/07, Eddie Yen wrote:
Hi Matt, thanks for your reply first.
The log I paste is from nova-compute. And I also check cinder-api & cinder-volume logs according from timestamp. Strange is, no error messages found during that time.
Hi, It could make sense that you see no errors in Cinder. The error from your pastebin is not coming from Cinder, it is coming from your HAProxy (or whatever load balancer you have in front of the Cinder-API nodes). Attachment delete is a synchronous operation, so all the different connection timeouts may affect the operation: Nova to HAProxy, HAProxy to Cinder-API, Cinder-API to Cinder-Volume via RabbitMQ, Cinder-Volume to Storage backend. I would recommend you looking at the specific attachment_delete request that failed in Cinder logs and see how long it took to complete, and then check how long it took for the 504 error to happen. With that info you can get an idea of how much higher your timeout must be. It could also happen that the Cinder-API raises a timeout error when calling the Cinder-Volume. In this case you should check the cinder-volume service to see how long it took it to complete, as the operation continues. Internally the Cinder-API to Cinder-Volume timeout is usually around 60 seconds (rpc_response_timeout). You need to ensure that your HAProxy and Cinder RPC timeouts are in sync and are enough for the operation to complete on the worst case scenario. Cheers, Gorka.
I remember I launch evacuation on the host.
Perhaps it's over-loading but it's not happening on the cinder. Because the environment is 3 all-in-one installation model. That means control+compute per node, and 3 nodes become controller HA. When I shutdown one of the node, I found all requests from API is pretty slow (can feel that when using dashboard.) And back to normal again when the node is back.
I'll try do the evacuation again but with just disable nova host or stop nova services, to test if that happen again or not.
Matt Riedemann <mriedemos@gmail.com> 於 2019年7月23日 週二 上午6:40寫道:
On 7/18/2019 3:53 AM, Eddie Yen wrote:
Before I try to evacuate host, the source host had about 24 VMs running. When I shutdown the node and execute evacuation, there're few VMs failed. The error code is 504. Strange is those VMs are all attach its own volume.
Then I check nova-compute log, a detailed error has pasted at below link; https://pastebin.com/uaE7YrP1
Does anyone have any experience with this? I googled but no enough information about this.
Are there errors in the cinder-api logs during the evacuate of all VMs from the host? Are you doing the evacuate operation on all VMs on the host concurrently or in serial? I wonder if you're over-loading cinder and that's causing the timeout somehow. The timeout from cinder is when deleting volume attachment records, which would be terminating connections in the storage backend under the covers. Check the cinder-volume logs for errors as well.
--
Thanks,
Matt
On 7/25/2019 3:14 AM, Gorka Eguileor wrote:
Attachment delete is a synchronous operation, so all the different connection timeouts may affect the operation: Nova to HAProxy, HAProxy to Cinder-API, Cinder-API to Cinder-Volume via RabbitMQ, Cinder-Volume to Storage backend.
I would recommend you looking at the specific attachment_delete request that failed in Cinder logs and see how long it took to complete, and then check how long it took for the 504 error to happen. With that info you can get an idea of how much higher your timeout must be.
It could also happen that the Cinder-API raises a timeout error when calling the Cinder-Volume. In this case you should check the cinder-volume service to see how long it took it to complete, as the operation continues.
Internally the Cinder-API to Cinder-Volume timeout is usually around 60 seconds (rpc_response_timeout).
Yeah this is a known intermittent issue in our CI jobs as well, for example: http://status.openstack.org/elastic-recheck/#1763712 As I mentioned in the bug report for that issue: https://bugs.launchpad.net/cinder/+bug/1763712 It might be worth using the long_rpc_timeout approach for this assuming the http response doesn't timeout. Nova uses long_rpc_timeout for known long RPC calls: https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.lon... Cinder should probably do the same for initialize connection style RPC calls. I've seen other gate failures where cinder-backup to cinder-volume rpc calls to initialize a connection have timed out as well, e.g.: https://bugs.launchpad.net/cinder/+bug/1739482 -- Thanks, Matt
Thanks for the advice. Actually, I tested evacuation again 2 days ago. And this time the evacuation is successful. All VMs included volume attached were evacuated with no error. The Horizon response still slow when shutdown one node. But it became more faster than before. But I think we still need to gain a longer timeout. And I think I should gain rpc_response_timeout rather than long_rpc_timeout in nova. Please correct me if wrong. Many thanks, Eddie. Matt Riedemann <mriedemos@gmail.com> 於 2019年7月25日 週四 下午8:11寫道:
On 7/25/2019 3:14 AM, Gorka Eguileor wrote:
Attachment delete is a synchronous operation, so all the different connection timeouts may affect the operation: Nova to HAProxy, HAProxy to Cinder-API, Cinder-API to Cinder-Volume via RabbitMQ, Cinder-Volume to Storage backend.
I would recommend you looking at the specific attachment_delete request that failed in Cinder logs and see how long it took to complete, and then check how long it took for the 504 error to happen. With that info you can get an idea of how much higher your timeout must be.
It could also happen that the Cinder-API raises a timeout error when calling the Cinder-Volume. In this case you should check the cinder-volume service to see how long it took it to complete, as the operation continues.
Internally the Cinder-API to Cinder-Volume timeout is usually around 60 seconds (rpc_response_timeout).
Yeah this is a known intermittent issue in our CI jobs as well, for example:
http://status.openstack.org/elastic-recheck/#1763712
As I mentioned in the bug report for that issue:
https://bugs.launchpad.net/cinder/+bug/1763712
It might be worth using the long_rpc_timeout approach for this assuming the http response doesn't timeout. Nova uses long_rpc_timeout for known long RPC calls:
https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.lon...
Cinder should probably do the same for initialize connection style RPC calls. I've seen other gate failures where cinder-backup to cinder-volume rpc calls to initialize a connection have timed out as well, e.g.:
https://bugs.launchpad.net/cinder/+bug/1739482
--
Thanks,
Matt
On 7/25/2019 11:54 PM, Eddie Yen wrote:
And I think I should gain rpc_response_timeout rather than long_rpc_timeout in nova.
Since Cinder doesn't have the long_rpc_timeout option like Nova you're only option is to bump up the rpc_response_timeout in Cinder but that will be used by all RPC calls in Cinder, not just the initialize/terminate connection calls for attachments. Maybe that's not a problem, but long_rpc_timeout in Nova allows us to pick which RPC calls to use that on rather than everywhere. The changes to Cinder shouldn't be that hard if they follow the Nova patch [1]. [1] https://review.opendev.org/#/c/566696/ -- Thanks, Matt
Roger that, thanks for explanation. I think there's another reason to me that get this issue. The environment is stayed without any internet nor local NTP server, until the last test. Before the test, the nova and cinder services became unstable because they keeping up and down. And I found that the clock are out of sync between nodes. We let one of the node can connect outside and let NTP client pointed to that one on other nodes. Then problem solved. Of course the test is successful. I'm not sure but that's a one of reason right? But I think I still need to try optimize the timeout value since the API response is slow when shutting down a node. Wonder know why it become slow when a node down. I'll try to gain up rpc_response_timeout in Cinder and do more testing. Matt Riedemann <mriedemos@gmail.com> 於 2019年7月26日 週五 下午9:42寫道:
On 7/25/2019 11:54 PM, Eddie Yen wrote:
And I think I should gain rpc_response_timeout rather than long_rpc_timeout in nova.
Since Cinder doesn't have the long_rpc_timeout option like Nova you're only option is to bump up the rpc_response_timeout in Cinder but that will be used by all RPC calls in Cinder, not just the initialize/terminate connection calls for attachments. Maybe that's not a problem, but long_rpc_timeout in Nova allows us to pick which RPC calls to use that on rather than everywhere. The changes to Cinder shouldn't be that hard if they follow the Nova patch [1].
[1] https://review.opendev.org/#/c/566696/
--
Thanks,
Matt
If time gets out of sync too much, you get auth errors due to tokens being from the future / from a too distant past. Kind regards, Radek pt., 26 lip 2019 o 17:12 Eddie Yen <missile0407@gmail.com> napisał(a):
Roger that, thanks for explanation.
I think there's another reason to me that get this issue. The environment is stayed without any internet nor local NTP server, until the last test. Before the test, the nova and cinder services became unstable because they keeping up and down. And I found that the clock are out of sync between nodes. We let one of the node can connect outside and let NTP client pointed to that one on other nodes. Then problem solved. Of course the test is successful.
I'm not sure but that's a one of reason right?
But I think I still need to try optimize the timeout value since the API response is slow when shutting down a node. Wonder know why it become slow when a node down.
I'll try to gain up rpc_response_timeout in Cinder and do more testing.
Matt Riedemann <mriedemos@gmail.com> 於 2019年7月26日 週五 下午9:42寫道:
On 7/25/2019 11:54 PM, Eddie Yen wrote:
And I think I should gain rpc_response_timeout rather than long_rpc_timeout in nova.
Since Cinder doesn't have the long_rpc_timeout option like Nova you're only option is to bump up the rpc_response_timeout in Cinder but that will be used by all RPC calls in Cinder, not just the initialize/terminate connection calls for attachments. Maybe that's not a problem, but long_rpc_timeout in Nova allows us to pick which RPC calls to use that on rather than everywhere. The changes to Cinder shouldn't be that hard if they follow the Nova patch [1].
[1] https://review.opendev.org/#/c/566696/
--
Thanks,
Matt
On Thu, 18 Jul 2019 at 09:54, Eddie Yen <missile0407@gmail.com> wrote:
Hi everyone, I met an issue when try to evacuate host. The platform is stable/rocky and using kolla-ansible to deploy. And all storage backends are connected to Ceph.
Before I try to evacuate host, the source host had about 24 VMs running. When I shutdown the node and execute evacuation, there're few VMs failed. The error code is 504. Strange is those VMs are all attach its own volume.
Then I check nova-compute log, a detailed error has pasted at below link; https://pastebin.com/uaE7YrP1
Does anyone have any experience with this? I googled but no enough information about this.
Thanks!
Gateway timeout suggests the server timeout in haproxy is too low, and the server (cinder-api) has not responded to the request in time. The default timeout is 60s, and is configured via haproxy_server_timeout (and possibly haproxy_client_timeout). You could try increasing this in globals.yml. We do use a larger timeout for glance-api (haproxy_glance_api_client_timeout and haproxy_glance_api_server_timeout, both 6h). Perhaps we need something similar for cinder-api.
On 8/6/2019 7:18 AM, Mark Goddard wrote:
We do use a larger timeout for glance-api (haproxy_glance_api_client_timeout and haproxy_glance_api_server_timeout, both 6h). Perhaps we need something similar for cinder-api.
A 6 hour timeout for cinder API calls would be nuts IMO. The thing that was failing was a volume attachment delete/create from what I recall, which is the newer version (as of Ocata?) for the old initialize_connection/terminate_connection APIs. These are synchronous RPC calls from cinder-api to cinder-volume to do things on the storage backend and we have seen them take longer than 60 seconds in the gate CI runs with the lvm driver. I think the investigation normally turned up lvchange taking over 60 seconds on some concurrent operation locking out the RPC call which eventually results in the MessagingTimeout from oslo.messaging. That's unrelated to your gateway timeout from HAProxy but the point is yeah you likely want to bump up those timeouts since cinder-api has these synchronous calls to the cinder-volume service. I just don't think you need to go to 6 hours :). I think the keystoneauth1 default http response timeout is 10 minutes so maybe try that. -- Thanks, Matt
On Tue, 6 Aug 2019 at 16:33, Matt Riedemann <mriedemos@gmail.com> wrote:
On 8/6/2019 7:18 AM, Mark Goddard wrote:
We do use a larger timeout for glance-api (haproxy_glance_api_client_timeout and haproxy_glance_api_server_timeout, both 6h). Perhaps we need something similar for cinder-api.
A 6 hour timeout for cinder API calls would be nuts IMO. The thing that was failing was a volume attachment delete/create from what I recall, which is the newer version (as of Ocata?) for the old initialize_connection/terminate_connection APIs. These are synchronous RPC calls from cinder-api to cinder-volume to do things on the storage backend and we have seen them take longer than 60 seconds in the gate CI runs with the lvm driver. I think the investigation normally turned up lvchange taking over 60 seconds on some concurrent operation locking out the RPC call which eventually results in the MessagingTimeout from oslo.messaging. That's unrelated to your gateway timeout from HAProxy but the point is yeah you likely want to bump up those timeouts since cinder-api has these synchronous calls to the cinder-volume service. I just don't think you need to go to 6 hours :). I think the keystoneauth1 default http response timeout is 10 minutes so maybe try that.
Yeah, wasn't advocating for 6 hours - just showing which knobs are available :)
--
Thanks,
Matt
Hi Mark, thanks for suggestion. I think this, too. Cinder-api may normal but HAproxy could be very busy since one controller down. I'll try to increase the value about cinder-api timeout. Mark Goddard <mark@stackhpc.com> 於 2019年8月7日 週三 上午12:06寫道:
On Tue, 6 Aug 2019 at 16:33, Matt Riedemann <mriedemos@gmail.com> wrote:
On 8/6/2019 7:18 AM, Mark Goddard wrote:
We do use a larger timeout for glance-api (haproxy_glance_api_client_timeout and haproxy_glance_api_server_timeout, both 6h). Perhaps we need something similar for cinder-api.
A 6 hour timeout for cinder API calls would be nuts IMO. The thing that was failing was a volume attachment delete/create from what I recall, which is the newer version (as of Ocata?) for the old initialize_connection/terminate_connection APIs. These are synchronous RPC calls from cinder-api to cinder-volume to do things on the storage backend and we have seen them take longer than 60 seconds in the gate CI runs with the lvm driver. I think the investigation normally turned up lvchange taking over 60 seconds on some concurrent operation locking out the RPC call which eventually results in the MessagingTimeout from oslo.messaging. That's unrelated to your gateway timeout from HAProxy but the point is yeah you likely want to bump up those timeouts since cinder-api has these synchronous calls to the cinder-volume service. I just don't think you need to go to 6 hours :). I think the keystoneauth1 default http response timeout is 10 minutes so maybe try that.
Yeah, wasn't advocating for 6 hours - just showing which knobs are available :)
--
Thanks,
Matt
On Thu, 8 Aug 2019 at 11:39, Eddie Yen <missile0407@gmail.com> wrote:
Hi Mark, thanks for suggestion.
I think this, too. Cinder-api may normal but HAproxy could be very busy since one controller down. I'll try to increase the value about cinder-api timeout.
Will you be proposing this fix upstream?
Mark Goddard <mark@stackhpc.com> 於 2019年8月7日 週三 上午12:06寫道:
On Tue, 6 Aug 2019 at 16:33, Matt Riedemann <mriedemos@gmail.com> wrote:
On 8/6/2019 7:18 AM, Mark Goddard wrote:
We do use a larger timeout for glance-api (haproxy_glance_api_client_timeout and haproxy_glance_api_server_timeout, both 6h). Perhaps we need something similar for cinder-api.
A 6 hour timeout for cinder API calls would be nuts IMO. The thing that was failing was a volume attachment delete/create from what I recall, which is the newer version (as of Ocata?) for the old initialize_connection/terminate_connection APIs. These are synchronous RPC calls from cinder-api to cinder-volume to do things on the storage backend and we have seen them take longer than 60 seconds in the gate CI runs with the lvm driver. I think the investigation normally turned up lvchange taking over 60 seconds on some concurrent operation locking out the RPC call which eventually results in the MessagingTimeout from oslo.messaging. That's unrelated to your gateway timeout from HAProxy but the point is yeah you likely want to bump up those timeouts since cinder-api has these synchronous calls to the cinder-volume service. I just don't think you need to go to 6 hours :). I think the keystoneauth1 default http response timeout is 10 minutes so maybe try that.
Yeah, wasn't advocating for 6 hours - just showing which knobs are available :)
--
Thanks,
Matt
Perhaps, but not so fast. I still need more investigation. Mark Goddard <mark@stackhpc.com> 於 2019年8月8日 週四 下午10:36寫道:
On Thu, 8 Aug 2019 at 11:39, Eddie Yen <missile0407@gmail.com> wrote:
Hi Mark, thanks for suggestion.
I think this, too. Cinder-api may normal but HAproxy could be very busy since one controller down. I'll try to increase the value about cinder-api timeout.
Will you be proposing this fix upstream?
Mark Goddard <mark@stackhpc.com> 於 2019年8月7日 週三 上午12:06寫道:
On Tue, 6 Aug 2019 at 16:33, Matt Riedemann <mriedemos@gmail.com> wrote:
On 8/6/2019 7:18 AM, Mark Goddard wrote:
We do use a larger timeout for glance-api (haproxy_glance_api_client_timeout and haproxy_glance_api_server_timeout, both 6h). Perhaps we need something similar for cinder-api.
A 6 hour timeout for cinder API calls would be nuts IMO. The thing that was failing was a volume attachment delete/create from what I recall, which is the newer version (as of Ocata?) for the old initialize_connection/terminate_connection APIs. These are synchronous RPC calls from cinder-api to cinder-volume to do things on the storage backend and we have seen them take longer than 60 seconds in the gate CI runs with the lvm driver. I think the investigation normally turned up lvchange taking over 60 seconds on some concurrent operation locking out the RPC call which eventually results in the MessagingTimeout from oslo.messaging. That's unrelated to your gateway timeout from HAProxy but the point is yeah you likely want to bump up those timeouts since cinder-api has these synchronous calls to the cinder-volume service. I just don't think you need to go to 6 hours :). I think the keystoneauth1 default http response timeout is 10 minutes so maybe try that.
Yeah, wasn't advocating for 6 hours - just showing which knobs are available :)
--
Thanks,
Matt
participants (5)
-
Eddie Yen
-
Gorka Eguileor
-
Mark Goddard
-
Matt Riedemann
-
Radosław Piliszek