<div dir="ltr"><div>Here's how I reproduce the original problem. I verify the cluster is working, then kill rabbit, then try again. </div><div><br></div><div> 1. start cluster, create vms, migrate: ok</div><div> 2. kill and restart rabbit</div>
<div> 3. migrate vm: timeout</div><div><br></div><div>Here's a trace from <a href="https://gist.github.com/noelbk/619426fa88c3bdd0534c">https://gist.github.com/noelbk/619426fa88c3bdd0534c</a> after rabbit was restarted. This</div>
<div>pattern repeats a few times during a migration after rabbit was</div><div>restarted:</div><div><br></div><div> 19:24:56 10.35.0.13 sends _msg_id = 29c6579c5de24c00b0b0e55579b8e047</div><div> 19:24:56 10.35.0.14 receives _msg_id = 29c6579c5de24c00b0b0e55579b8e047</div>
<div> 19:24:56 10.35.0.14 acknowledges _msg_id = 29c6579c5de24c00b0b0e55579b8e047</div><div> 19:25:56 10.35.0.13 Timed out waiting for a reply to message ID 29c6579c5de24c00b0b0e55579b8e047</div><div><br></div><div>
I'm instrumenting the rpc calls now to see if they all actually do retry and complete after the timeout errors. I'm trying to get a trace of all the rpc calls to see if they're being acknowledged but not replied to in time.</div>
<div><br></div><div>While digging through oslo.messaging, I noticed that in amqpdriver.py, the incoming queues in ReplyWaiter and AMQPListener are plain arrays, not thread-safe Queues. ReplyWaiter does acquire a lock, but I'm not sure if the plain arrays are always thread-safe. Not sure if this is causing my issue.</div>
<div><br></div><div>Still digging, will keep you updated and updating <a href="https://bugs.launchpad.net/oslo/+bug/1338732">https://bugs.launchpad.net/oslo/+bug/1338732</a></div><div><br></div><div>--</div><div>Noel</div>
<div><br></div><div><br></div><div><br></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Tue, Jul 8, 2014 at 4:30 AM, Gordon Sim <span dir="ltr"><<a href="mailto:gsim@redhat.com" target="_blank">gsim@redhat.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="">On 07/08/2014 02:00 AM, Noel Burton-Krahn wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
The thing is, that produces errors exactly like what I'm seeing in nova<br>
if rabbit dies and we reconnect to a new rabbit instance.<br>
</blockquote>
<br></div>
A call timing out while waiting for a response is a fairly general problem for which there could be different causes.<div class=""><br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
I'm tracing<br>
through the nova calls in the rabbit reconnect case to confirm that<br>
acknowledge is always being called when it should be.<br>
</blockquote>
<br></div>
Even if it is, the acknowledgement could be lost if the connection to rabbitmq fails. However I don't think that is likely to be the cause of the time out. Unlike in the example, in a real oslo.messaging based service the fact that the request is redelivered shouldn't be a problem. The reply issued to it may be ignored or dropped, but the subsequent requests will be processed.<br>
<br>
I'm not completely clear on what the timing is in your original problem. You say the timeout happens after a restart. Is it immediately after (i.e. could some connections still be detecting the failure)? Or long enough after that you are confident everything has failed over correctly?<br>
<br>
(Obviously a failure or restart *during* a call may well result in a timeout; that is the expected semantics at present).<div class="HOEnZb"><div class="h5"><br>
<br>
______________________________<u></u>_________________<br>
Mailing list: <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack" target="_blank">http://lists.openstack.org/<u></u>cgi-bin/mailman/listinfo/<u></u>openstack</a><br>
Post to : <a href="mailto:openstack@lists.openstack.org" target="_blank">openstack@lists.openstack.org</a><br>
Unsubscribe : <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack" target="_blank">http://lists.openstack.org/<u></u>cgi-bin/mailman/listinfo/<u></u>openstack</a><br>
</div></div></blockquote></div><br></div>