<div dir="ltr">Hi Gordon thanks for your reply.<div><br></div><div>That was exactly the problem with my example: no acknowledge meant the old messages were stuck in the queue, leading to no rpc reply. I created my test program from oslo.messaging/tests/test_rabbit.py, which didn't have any calls to acknowledge(). The thing is, that produces errors exactly like what I'm seeing in nova if rabbit dies and we reconnect to a new rabbit instance. I'm tracing through the nova calls in the rabbit reconnect case to confirm that acknowledge is always being called when it should be.</div>
<div><br></div><div>Cheers,</div><div>--</div><div>Noel</div><div><br></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Mon, Jul 7, 2014 at 3:43 PM, Gordon Sim <span dir="ltr"><<a href="mailto:gsim@redhat.com" target="_blank">gsim@redhat.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="">On 07/06/2014 01:02 AM, Noel Burton-Krahn wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Icehouse<br>
oslo-messaging 1.3.0<br>
rabbitmq-server 3.1.3<br>
<br>
We've noticed that nova rpc calls fail often after rabbit restarts.<br>
I've tracked it down to oslo/rabbit/kombu timing out if it's forced to<br>
reconnect to rabbit. The code below times out waiting for a reply if<br>
the topic has been used in a previous run. The reply always arrives the<br>
first time a topic is used, or if the topic is none. But, the second<br>
run with the same topic will hang with this error:<br>
<br>
MessagingTimeout: Timed out waiting for a reply to message ID ...<br>
<br>
<br>
This problem seems too basic to not be caught earlier in oslo, but the<br>
program below does really reproduce the same symptoms we see in nova<br>
when run against a live rabbit server. What's wrong with this picture?<br>
</blockquote>
<br></div>
Just a theory, but could the issue with the simple example be the following:<br>
<br>
* the same queue is used for the first and second run<br>
* the first request is not acknowledged so when the first test exits its left on the queue<br>
* on the second attempt, you retrieve the same first request, whose reply-to address is no longer valid so the reply is never delivered<br>
* you then try to join the sender thread without pulling off another message, so you don't get to the second request<br>
<br>
Just a theory as I say. Also doesn't explain the actual issue as you observed with nova. Its just a property of this example.<span class="HOEnZb"><font color="#888888"><br>
<br>
--Gordon.<br>
<br>
</font></span><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div class="h5">
Cheers<br>
--<br>
Noel<br>
<br>
<br>
#! /usr/bin/python<br>
<br>
from oslo.config import cfg<br>
import threading<br>
from oslo import messaging<br>
import logging<br>
import time<br>
log = logging.getLogger(__name__)<br>
<br>
class OsloTest():<br>
def test(self):<br>
# The code below times out waiting for a reply if the topic<br>
# has been used in a previous run. The reply always arrives<br>
# the first time a topic is used, or if the topic is none.<br>
# But, the second run with the same topic will hang with this<br>
# error:<br>
#<br>
# MessagingTimeout: Timed out waiting for a reply to message ID ...<br>
#<br>
topic = 'will_hang_on_second_usage'<br>
#topic = None # never hangs<br>
<br>
url = "%(proto)s://%(user)s:%(<u></u>password)s@%(host)s/" % dict(<br>
proto = 'rabbit',<br>
host = '1.2.3.4',<br>
password = 'xxxxxxxx',<br>
user = 'rabbit-mq-user',<br>
)<br>
transport = messaging.get_transport(cfg.<u></u>CONF, url)<br>
driver = transport._driver<br>
<br>
target = messaging.Target(topic=topic)<br>
listener = driver.listen(target)<br>
ctxt={"context": True}<br>
timeout = 10<br>
<br>
def send_main():<br>
log.debug("sending msg")<br>
reply = driver.send(target,<br>
ctxt,<br>
{'send': 1},<br>
wait_for_reply=True,<br>
timeout=timeout)<br>
<br>
# times out if topic was not None and used before<br>
log.debug("received reply=%r" % (reply,))<br>
<br>
send_thread = threading.Thread(target=send_<u></u>main)<br>
send_thread.daemon = True<br>
send_thread.start()<br>
<br>
msg = listener.poll()<br>
log.debug("received msg=%r" % (msg,))<br>
<br>
msg.reply({'reply': 1})<br>
<br>
log.debug("sent reply")<br>
<br>
send_thread.join()<br>
<br>
if __name__ == '__main__':<br>
FORMAT = '%(asctime)-15s %(process)5d %(thread)5d %(filename)s<br>
%(funcName)s %(message)s'<br>
logging.basicConfig(level=<u></u>logging.DEBUG, format=FORMAT)<br>
OsloTest().test()<br>
<br>
<br></div></div><div class="">
______________________________<u></u>_________________<br>
Mailing list: <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack" target="_blank">http://lists.openstack.org/<u></u>cgi-bin/mailman/listinfo/<u></u>openstack</a><br>
Post to : <a href="mailto:openstack@lists.openstack.org" target="_blank">openstack@lists.openstack.org</a><br>
Unsubscribe : <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack" target="_blank">http://lists.openstack.org/<u></u>cgi-bin/mailman/listinfo/<u></u>openstack</a><br>
<br>
</div></blockquote><div class="HOEnZb"><div class="h5">
<br>
<br>
______________________________<u></u>_________________<br>
Mailing list: <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack" target="_blank">http://lists.openstack.org/<u></u>cgi-bin/mailman/listinfo/<u></u>openstack</a><br>
Post to : <a href="mailto:openstack@lists.openstack.org" target="_blank">openstack@lists.openstack.org</a><br>
Unsubscribe : <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack" target="_blank">http://lists.openstack.org/<u></u>cgi-bin/mailman/listinfo/<u></u>openstack</a><br>
</div></div></blockquote></div><br></div>