<div dir="ltr">Thanks for the advice.<div><br></div><div>Actually, I tested evacuation again 2 days ago. And this time the evacuation is successful. All VMs included volume attached were evacuated with no error.</div><div>The Horizon response still slow when shutdown one node. But it became more faster than before.</div><div>But I think we still need to gain a longer timeout.</div><div><br></div><div>And I think I should gain rpc_response_timeout rather than long_rpc_timeout in nova.</div><div>Please correct me if wrong.</div><div><br></div><div>Many thanks,</div><div>Eddie.</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Matt Riedemann <<a href="mailto:mriedemos@gmail.com">mriedemos@gmail.com</a>> 於 2019年7月25日 週四 下午8:11寫道：<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On 7/25/2019 3:14 AM, Gorka Eguileor wrote:<br>

> Attachment delete is a synchronous operation, so all the different<br>

> connection timeouts may affect the operation: Nova to HAProxy, HAProxy<br>

> to Cinder-API, Cinder-API to Cinder-Volume via RabbitMQ, Cinder-Volume<br>

> to Storage backend.<br>

> <br>

> I would recommend you looking at the specific attachment_delete request<br>

> that failed in Cinder logs and see how long it took to complete, and<br>

> then check how long it took for the 504 error to happen.  With that info<br>

> you can get an idea of how much higher your timeout must be.<br>

> <br>

> It could also happen that the Cinder-API raises a timeout error when<br>

> calling the Cinder-Volume.  In this case you should check the<br>

> cinder-volume service to see how long it took it to complete, as the<br>

> operation continues.<br>

> <br>

> Internally the Cinder-API to Cinder-Volume timeout is usually around 60<br>

> seconds (rpc_response_timeout).<br>

<br>

Yeah this is a known intermittent issue in our CI jobs as well, for example:<br>

<br>

<a href="http://status.openstack.org/elastic-recheck/#1763712" rel="noreferrer" target="_blank">http://status.openstack.org/elastic-recheck/#1763712</a><br>

<br>

As I mentioned in the bug report for that issue:<br>

<br>

<a href="https://bugs.launchpad.net/cinder/+bug/1763712" rel="noreferrer" target="_blank">https://bugs.launchpad.net/cinder/+bug/1763712</a><br>

<br>

It might be worth using the long_rpc_timeout approach for this assuming <br>

the http response doesn't timeout. Nova uses long_rpc_timeout for known <br>

long RPC calls:<br>

<br>

<a href="https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.long_rpc_timeout" rel="noreferrer" target="_blank">https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.long_rpc_timeout</a><br>

<br>

Cinder should probably do the same for initialize connection style RPC <br>

calls. I've seen other gate failures where cinder-backup to <br>

cinder-volume rpc calls to initialize a connection have timed out as <br>

well, e.g.:<br>

<br>

<a href="https://bugs.launchpad.net/cinder/+bug/1739482" rel="noreferrer" target="_blank">https://bugs.launchpad.net/cinder/+bug/1739482</a><br>

<br>

-- <br>

<br>

Thanks,<br>

<br>

Matt<br>

<br>

</blockquote></div>