Thanks for the advice.

Actually, I tested evacuation again 2 days ago. And this time the evacuation is successful. All VMs included volume attached were evacuated with no error.
The Horizon response still slow when shutdown one node. But it became more faster than before.
But I think we still need to gain a longer timeout.

And I think I should gain rpc_response_timeout rather than long_rpc_timeout in nova.
Please correct me if wrong.

Many thanks,
Eddie.

Matt Riedemann <mriedemos@gmail.com> 於 2019年7月25日 週四 下午8:11寫道:
On 7/25/2019 3:14 AM, Gorka Eguileor wrote:
> Attachment delete is a synchronous operation, so all the different
> connection timeouts may affect the operation: Nova to HAProxy, HAProxy
> to Cinder-API, Cinder-API to Cinder-Volume via RabbitMQ, Cinder-Volume
> to Storage backend.
>
> I would recommend you looking at the specific attachment_delete request
> that failed in Cinder logs and see how long it took to complete, and
> then check how long it took for the 504 error to happen.  With that info
> you can get an idea of how much higher your timeout must be.
>
> It could also happen that the Cinder-API raises a timeout error when
> calling the Cinder-Volume.  In this case you should check the
> cinder-volume service to see how long it took it to complete, as the
> operation continues.
>
> Internally the Cinder-API to Cinder-Volume timeout is usually around 60
> seconds (rpc_response_timeout).

Yeah this is a known intermittent issue in our CI jobs as well, for example:

http://status.openstack.org/elastic-recheck/#1763712

As I mentioned in the bug report for that issue:

https://bugs.launchpad.net/cinder/+bug/1763712

It might be worth using the long_rpc_timeout approach for this assuming
the http response doesn't timeout. Nova uses long_rpc_timeout for known
long RPC calls:

https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.long_rpc_timeout

Cinder should probably do the same for initialize connection style RPC
calls. I've seen other gate failures where cinder-backup to
cinder-volume rpc calls to initialize a connection have timed out as
well, e.g.:

https://bugs.launchpad.net/cinder/+bug/1739482

--

Thanks,

Matt