On 23/07, Eddie Yen wrote:
Hi Matt, thanks for your reply first.
The log I paste is from nova-compute. And I also check cinder-api & cinder-volume logs according from timestamp. Strange is, no error messages found during that time.
Hi, It could make sense that you see no errors in Cinder. The error from your pastebin is not coming from Cinder, it is coming from your HAProxy (or whatever load balancer you have in front of the Cinder-API nodes). Attachment delete is a synchronous operation, so all the different connection timeouts may affect the operation: Nova to HAProxy, HAProxy to Cinder-API, Cinder-API to Cinder-Volume via RabbitMQ, Cinder-Volume to Storage backend. I would recommend you looking at the specific attachment_delete request that failed in Cinder logs and see how long it took to complete, and then check how long it took for the 504 error to happen. With that info you can get an idea of how much higher your timeout must be. It could also happen that the Cinder-API raises a timeout error when calling the Cinder-Volume. In this case you should check the cinder-volume service to see how long it took it to complete, as the operation continues. Internally the Cinder-API to Cinder-Volume timeout is usually around 60 seconds (rpc_response_timeout). You need to ensure that your HAProxy and Cinder RPC timeouts are in sync and are enough for the operation to complete on the worst case scenario. Cheers, Gorka.
I remember I launch evacuation on the host.
Perhaps it's over-loading but it's not happening on the cinder. Because the environment is 3 all-in-one installation model. That means control+compute per node, and 3 nodes become controller HA. When I shutdown one of the node, I found all requests from API is pretty slow (can feel that when using dashboard.) And back to normal again when the node is back.
I'll try do the evacuation again but with just disable nova host or stop nova services, to test if that happen again or not.
Matt Riedemann <mriedemos@gmail.com> 於 2019年7月23日 週二 上午6:40寫道:
On 7/18/2019 3:53 AM, Eddie Yen wrote:
Before I try to evacuate host, the source host had about 24 VMs running. When I shutdown the node and execute evacuation, there're few VMs failed. The error code is 504. Strange is those VMs are all attach its own volume.
Then I check nova-compute log, a detailed error has pasted at below link; https://pastebin.com/uaE7YrP1
Does anyone have any experience with this? I googled but no enough information about this.
Are there errors in the cinder-api logs during the evacuate of all VMs from the host? Are you doing the evacuate operation on all VMs on the host concurrently or in serial? I wonder if you're over-loading cinder and that's causing the timeout somehow. The timeout from cinder is when deleting volume attachment records, which would be terminating connections in the storage backend under the covers. Check the cinder-volume logs for errors as well.
--
Thanks,
Matt