[Openstack] olso-messaging times out after reconnecting to rabbit

Noel Burton-Krahn noel at pistoncloud.com
Tue Jul 8 01:00:45 UTC 2014


Hi Gordon thanks for your reply.

That was exactly the problem with my example: no acknowledge meant the old
messages were stuck in the queue, leading to no rpc reply.  I created my
test program from oslo.messaging/tests/test_rabbit.py, which didn't have
any calls to acknowledge().  The thing is, that produces errors exactly
like what I'm seeing in nova if rabbit dies and we reconnect to a new
rabbit instance.  I'm tracing through the nova calls in the rabbit
reconnect case to confirm that acknowledge is always being called when it
should be.

Cheers,
--
Noel



On Mon, Jul 7, 2014 at 3:43 PM, Gordon Sim <gsim at redhat.com> wrote:

> On 07/06/2014 01:02 AM, Noel Burton-Krahn wrote:
>
>> Icehouse
>> oslo-messaging 1.3.0
>> rabbitmq-server 3.1.3
>>
>> We've noticed that nova rpc calls fail often after rabbit restarts.
>>   I've tracked it down to oslo/rabbit/kombu timing out if it's forced to
>> reconnect to rabbit.  The code below times out waiting for a reply if
>> the topic has been used in a previous run.  The reply always arrives the
>> first time a topic is used, or if the topic is none.  But, the second
>> run with the same topic will hang with this error:
>>
>>     MessagingTimeout: Timed out waiting for a reply to message ID ...
>>
>>
>> This problem seems too basic to not be caught earlier in oslo, but the
>> program below does really reproduce the same symptoms we see in nova
>> when run against a live rabbit server.  What's wrong with this picture?
>>
>
> Just a theory, but could the issue with the simple example be the
> following:
>
> * the same queue is used for the first and second run
> * the first request is not acknowledged so when the first test exits its
> left on the queue
> * on the second attempt, you retrieve the same first request, whose
> reply-to address is no longer valid so the reply is never delivered
> * you then try to join the sender thread without pulling off another
> message, so you don't get to the second request
>
> Just a theory as I say. Also doesn't explain the actual issue as you
> observed with nova. Its just a property of this example.
>
> --Gordon.
>
>  Cheers
>> --
>> Noel
>>
>>
>> #! /usr/bin/python
>>
>> from oslo.config import cfg
>> import threading
>> from oslo import messaging
>> import logging
>> import time
>> log = logging.getLogger(__name__)
>>
>> class OsloTest():
>>      def test(self):
>>          # The code below times out waiting for a reply if the topic
>>          # has been used in a previous run.  The reply always arrives
>>          # the first time a topic is used, or if the topic is none.
>>          # But, the second run with the same topic will hang with this
>>          # error:
>>          #
>>          # MessagingTimeout: Timed out waiting for a reply to message ID
>> ...
>>          #
>>          topic  = 'will_hang_on_second_usage'
>>          #topic  = None # never hangs
>>
>>          url = "%(proto)s://%(user)s:%(password)s@%(host)s/" % dict(
>>              proto = 'rabbit',
>>              host = '1.2.3.4',
>>              password = 'xxxxxxxx',
>>              user = 'rabbit-mq-user',
>>              )
>>          transport = messaging.get_transport(cfg.CONF, url)
>>          driver = transport._driver
>>
>>          target = messaging.Target(topic=topic)
>>          listener = driver.listen(target)
>>          ctxt={"context": True}
>>          timeout = 10
>>
>>          def send_main():
>>              log.debug("sending msg")
>>              reply = driver.send(target,
>>                                  ctxt,
>>                                  {'send': 1},
>>                                  wait_for_reply=True,
>>                                  timeout=timeout)
>>
>>              # times out if topic was not None and used before
>>              log.debug("received reply=%r" % (reply,))
>>
>>          send_thread = threading.Thread(target=send_main)
>>          send_thread.daemon = True
>>          send_thread.start()
>>
>>          msg = listener.poll()
>>          log.debug("received msg=%r" % (msg,))
>>
>>          msg.reply({'reply': 1})
>>
>>          log.debug("sent reply")
>>
>>          send_thread.join()
>>
>> if __name__ == '__main__':
>>      FORMAT = '%(asctime)-15s %(process)5d %(thread)5d %(filename)s
>> %(funcName)s %(message)s'
>>      logging.basicConfig(level=logging.DEBUG, format=FORMAT)
>>      OsloTest().test()
>>
>>
>> _______________________________________________
>> Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/
>> openstack
>> Post to     : openstack at lists.openstack.org
>> Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/
>> openstack
>>
>>
>
> _______________________________________________
> Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/
> openstack
> Post to     : openstack at lists.openstack.org
> Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/
> openstack
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack/attachments/20140707/0231acff/attachment.html>


More information about the Openstack mailing list