[openstack-dev] [oslo] debugging the failures in oslo.messaging gate

Doug Hellmann doug at doughellmann.com
Fri Aug 14 22:58:20 UTC 2015


All patches to oslo.messaging are currently failing the gate-tempest-dsvm-neutron-src-oslo.messaging job because the neutron service dies. amuller, kevinbenton, and I spent a bunch of time looking at it today, and I think we have an issue introduced by some asymmetric gating between the two projects.

Neutron has 2 different modes for starting the RPC service, depending on the number of workers requested. The problem comes up with rpc_workers=0, which is the new default. In that mode, rather than using the ProcessLauncher, the RPC server is started directly in the current process. That results in wait() being called in a way that violates the new constraints being enforced within oslo.messaging after [1] landed. That patch is unreleased, so the only project seeing the problem is oslo.messaging. I’ve proposed a revert in [2], which passes the gate tests.

I have also added [3] to neutron to see if we can get the gate job to show the same error messages I was seeing locally (part of the trouble we’ve had with debugging this is the process exits quickly enough that some of the log messages are never being written). I’m using [4] as a patch in oslo.messaging that was failing before to trigger the job to get the necessary log. That patch should *not* be landed, since I don’t think the change it reverts is related to the problem, it was just handy for debugging.

The error message I see locally, “start/stop/wait must be called in the same thread”, is visible in this log snippet [5].

It’s not clear what the best path forward is. Obviously neutron is doing something with the RPC server that oslo.messaging doesn’t expect/want/like, but also obviously we can’t release oslo.messaging in its current state and break neutron. Someone with a better understanding of both neutron and oslo.messaging may be able to fix neutron’s use of the RPC code to avoid this case. There may be other users of oslo.messaging with the same ‘broken’ pattern, but IIRC neutron is unique in the way it runs both RPC and API services in the same process. To be safe, though, it may be better to log error messages instead of doing whatever we’re doing now to cause the process to exit. We can then set up a log stash search for the error message and find other applications that would be broken, fix them, and then switch oslo.messaging back to throwing an exception.

I’m going to be at the Ops summit next week, so I need to hand off debugging and fixing the issue to someone else on the Oslo team. We created an etherpad to track progress and make notes today, and all of these links are referenced there, too [6].

Thanks again to amuller and kevinbenton for the time they spent helping with debugging today!

Doug

[1] https://review.openstack.org/#/c/209043/
[2] https://review.openstack.org/#/c/213299/
[3] https://review.openstack.org/#/c/213360/
[4] https://review.openstack.org/#/c/213297/
[6] http://paste.openstack.org/show/415030/
[6] https://etherpad.openstack.org/p/wm2D6UGZbf




More information about the OpenStack-dev mailing list