[Openstack] olso-messaging times out after reconnecting to rabbit

Noel Burton-Krahn noel at pistoncloud.com
Wed Jul 9 17:58:37 UTC 2014


Here's how I reproduce the original problem.  I verify the cluster is
working, then kill rabbit, then try again.

    1. start cluster, create vms, migrate: ok
    2. kill and restart rabbit
    3. migrate vm: timeout

Here's a trace from https://gist.github.com/noelbk/619426fa88c3bdd0534c
after rabbit was restarted.  This
pattern repeats a few times during a migration after rabbit was
restarted:

    19:24:56 10.35.0.13 sends _msg_id = 29c6579c5de24c00b0b0e55579b8e047
    19:24:56 10.35.0.14 receives _msg_id = 29c6579c5de24c00b0b0e55579b8e047
    19:24:56 10.35.0.14 acknowledges _msg_id =
29c6579c5de24c00b0b0e55579b8e047
    19:25:56 10.35.0.13 Timed out waiting for a reply to message ID
29c6579c5de24c00b0b0e55579b8e047

I'm instrumenting the rpc calls now to see if they all actually do retry
and complete after the timeout errors. I'm trying to get a trace of all the
rpc calls to see if they're being acknowledged but not replied to in time.

While digging through oslo.messaging, I noticed that in amqpdriver.py, the
incoming queues in ReplyWaiter and AMQPListener are plain arrays, not
thread-safe Queues.  ReplyWaiter does acquire a lock, but I'm not sure if
the plain arrays are always thread-safe.  Not sure if this is causing my
issue.

Still digging, will keep you updated and updating
https://bugs.launchpad.net/oslo/+bug/1338732

--
Noel





On Tue, Jul 8, 2014 at 4:30 AM, Gordon Sim <gsim at redhat.com> wrote:

> On 07/08/2014 02:00 AM, Noel Burton-Krahn wrote:
>
>> The thing is, that produces errors exactly like what I'm seeing in nova
>> if rabbit dies and we reconnect to a new rabbit instance.
>>
>
> A call timing out while waiting for a response is a fairly general problem
> for which there could be different causes.
>
>
>   I'm tracing
>> through the nova calls in the rabbit reconnect case to confirm that
>> acknowledge is always being called when it should be.
>>
>
> Even if it is, the acknowledgement could be lost if the connection to
> rabbitmq fails. However I don't think that is likely to be the cause of the
> time out. Unlike in the example, in a real oslo.messaging based service the
> fact that the request is redelivered shouldn't be a problem. The reply
> issued to it may be ignored or dropped, but the subsequent requests will be
> processed.
>
> I'm not completely clear on what the timing is in your original problem.
> You say the timeout happens after a restart. Is it immediately after (i.e.
> could some connections still be detecting the failure)? Or long enough
> after that you are confident everything has failed over correctly?
>
> (Obviously a failure or restart *during* a call may well result in a
> timeout; that is the expected semantics at present).
>
>
> _______________________________________________
> Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/
> openstack
> Post to     : openstack at lists.openstack.org
> Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/
> openstack
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack/attachments/20140709/56c18b8a/attachment.html>


More information about the Openstack mailing list