[openstack-dev] [oslo] debugging the failures in oslo.messaging gate

Doug Hellmann doug at doughellmann.com
Mon Aug 17 11:49:06 UTC 2015


Excerpts from Davanum Srinivas (dims)'s message of 2015-08-16 17:40:16 -0400:
> Doug,
> 
> I've filed https://review.openstack.org/213542 to log error messages. Will
> work with oslo.messaging folks the next few days.

Thanks, Dims!

> 
> Thanks,
> Dims
> 
> On Fri, Aug 14, 2015 at 6:58 PM, Doug Hellmann <doug at doughellmann.com>
> wrote:
> 
> > All patches to oslo.messaging are currently failing the
> > gate-tempest-dsvm-neutron-src-oslo.messaging job because the neutron
> > service dies. amuller, kevinbenton, and I spent a bunch of time looking at
> > it today, and I think we have an issue introduced by some asymmetric gating
> > between the two projects.
> >
> > Neutron has 2 different modes for starting the RPC service, depending on
> > the number of workers requested. The problem comes up with rpc_workers=0,
> > which is the new default. In that mode, rather than using the
> > ProcessLauncher, the RPC server is started directly in the current process.
> > That results in wait() being called in a way that violates the new
> > constraints being enforced within oslo.messaging after [1] landed. That
> > patch is unreleased, so the only project seeing the problem is
> > oslo.messaging. I’ve proposed a revert in [2], which passes the gate tests.
> >
> > I have also added [3] to neutron to see if we can get the gate job to show
> > the same error messages I was seeing locally (part of the trouble we’ve had
> > with debugging this is the process exits quickly enough that some of the
> > log messages are never being written). I’m using [4] as a patch in
> > oslo.messaging that was failing before to trigger the job to get the
> > necessary log. That patch should *not* be landed, since I don’t think the
> > change it reverts is related to the problem, it was just handy for
> > debugging.
> >
> > The error message I see locally, “start/stop/wait must be called in the
> > same thread”, is visible in this log snippet [5].
> >
> > It’s not clear what the best path forward is. Obviously neutron is doing
> > something with the RPC server that oslo.messaging doesn’t expect/want/like,
> > but also obviously we can’t release oslo.messaging in its current state and
> > break neutron. Someone with a better understanding of both neutron and
> > oslo.messaging may be able to fix neutron’s use of the RPC code to avoid
> > this case. There may be other users of oslo.messaging with the same
> > ‘broken’ pattern, but IIRC neutron is unique in the way it runs both RPC
> > and API services in the same process. To be safe, though, it may be better
> > to log error messages instead of doing whatever we’re doing now to cause
> > the process to exit. We can then set up a log stash search for the error
> > message and find other applications that would be broken, fix them, and
> > then switch oslo.messaging back to throwing an exception.
> >
> > I’m going to be at the Ops summit next week, so I need to hand off
> > debugging and fixing the issue to someone else on the Oslo team. We created
> > an etherpad to track progress and make notes today, and all of these links
> > are referenced there, too [6].
> >
> > Thanks again to amuller and kevinbenton for the time they spent helping
> > with debugging today!
> >
> > Doug
> >
> > [1] https://review.openstack.org/#/c/209043/
> > [2] https://review.openstack.org/#/c/213299/
> > [3] https://review.openstack.org/#/c/213360/
> > [4] https://review.openstack.org/#/c/213297/
> > [6] http://paste.openstack.org/show/415030/
> > [6] https://etherpad.openstack.org/p/wm2D6UGZbf
> >
> >
> 



More information about the OpenStack-dev mailing list