[openstack-dev] [oslo] debugging the failures in oslo.messaging gate
Davanum Srinivas
davanum at gmail.com
Sun Aug 16 21:40:16 UTC 2015
Doug,
I've filed https://review.openstack.org/213542 to log error messages. Will
work with oslo.messaging folks the next few days.
Thanks,
Dims
On Fri, Aug 14, 2015 at 6:58 PM, Doug Hellmann <doug at doughellmann.com>
wrote:
> All patches to oslo.messaging are currently failing the
> gate-tempest-dsvm-neutron-src-oslo.messaging job because the neutron
> service dies. amuller, kevinbenton, and I spent a bunch of time looking at
> it today, and I think we have an issue introduced by some asymmetric gating
> between the two projects.
>
> Neutron has 2 different modes for starting the RPC service, depending on
> the number of workers requested. The problem comes up with rpc_workers=0,
> which is the new default. In that mode, rather than using the
> ProcessLauncher, the RPC server is started directly in the current process.
> That results in wait() being called in a way that violates the new
> constraints being enforced within oslo.messaging after [1] landed. That
> patch is unreleased, so the only project seeing the problem is
> oslo.messaging. I’ve proposed a revert in [2], which passes the gate tests.
>
> I have also added [3] to neutron to see if we can get the gate job to show
> the same error messages I was seeing locally (part of the trouble we’ve had
> with debugging this is the process exits quickly enough that some of the
> log messages are never being written). I’m using [4] as a patch in
> oslo.messaging that was failing before to trigger the job to get the
> necessary log. That patch should *not* be landed, since I don’t think the
> change it reverts is related to the problem, it was just handy for
> debugging.
>
> The error message I see locally, “start/stop/wait must be called in the
> same thread”, is visible in this log snippet [5].
>
> It’s not clear what the best path forward is. Obviously neutron is doing
> something with the RPC server that oslo.messaging doesn’t expect/want/like,
> but also obviously we can’t release oslo.messaging in its current state and
> break neutron. Someone with a better understanding of both neutron and
> oslo.messaging may be able to fix neutron’s use of the RPC code to avoid
> this case. There may be other users of oslo.messaging with the same
> ‘broken’ pattern, but IIRC neutron is unique in the way it runs both RPC
> and API services in the same process. To be safe, though, it may be better
> to log error messages instead of doing whatever we’re doing now to cause
> the process to exit. We can then set up a log stash search for the error
> message and find other applications that would be broken, fix them, and
> then switch oslo.messaging back to throwing an exception.
>
> I’m going to be at the Ops summit next week, so I need to hand off
> debugging and fixing the issue to someone else on the Oslo team. We created
> an etherpad to track progress and make notes today, and all of these links
> are referenced there, too [6].
>
> Thanks again to amuller and kevinbenton for the time they spent helping
> with debugging today!
>
> Doug
>
> [1] https://review.openstack.org/#/c/209043/
> [2] https://review.openstack.org/#/c/213299/
> [3] https://review.openstack.org/#/c/213360/
> [4] https://review.openstack.org/#/c/213297/
> [6] http://paste.openstack.org/show/415030/
> [6] https://etherpad.openstack.org/p/wm2D6UGZbf
>
>
--
Davanum Srinivas :: https://twitter.com/dims
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20150816/2888c4a5/attachment.html>
More information about the OpenStack-dev
mailing list