[openstack-dev] [oslo][barbican][sahara] start RPC service before launcher wait?

Ken Giusti kgiusti at gmail.com
Mon Sep 18 18:46:23 UTC 2017


On Thu, Sep 14, 2017 at 7:33 PM, Adam Spiers <aspiers at suse.com> wrote:
>
> Hi Ken,
>
> Thanks a lot for the analysis, and sorry for the slow reply!
> Comments inline...
>
> Ken Giusti <kgiusti at gmail.com> wrote:
> > Hi Adam,
> >
> > I think there's a couple of problems here.
> >
> > Regardless of worker count, the service.wait() is called before
> > service.start().  And from looking at the oslo.service code, the 'wait()'
> > method is call after start(), then again after stop().  This doesn't match
> > up with the intended use of oslo.messaging.server.wait(), which should only
> > be called after .stop().
>
> Hmm, so are you saying that there might be a bug in oslo.service's
> usage of oslo.messaging, and that this Sahara bugfix was the wrong
> approach too?
>
> https://review.openstack.org/#/c/280741/1/sahara/cli/sahara_engine.py
>

Well, I don't think the explicit call to start() is going to help,
esp. if the number of workers is > 1 since the workers are forked and
need to call start() from their own process space..
In fact, if # of workers > 1 then you not only get an RPC server in
each worker process, you'll end up with an extra RPC
server in the calling thread.

Take a look at a test service I've created for oslo.messaging:

https://pastebin.com/rSA6AD82

If you change the main code to call the new sequence, you'll end up
with 3 rpc servers (2 in the workers, one in the main process).

In that code I've made the wait() call a no op if the server hasn't
been started first.   And the stop method will call stop and wait on
the rpc server, which is the expected sequence as far as
oslo.messaging is concerned.

To me it seems that the bug is in oslo.service - calling wait() before
start() doesn't make sense to me.

> > Perhaps a bigger issue is that in the multi threaded case all threads
> > appear to be calling start, wait, and stop on the same instance of the
> > service (oslo.messaging rpc server).  At least that's what I'm seeing in my
> > muchly reduced test code:

I was wrong about this - I failed to notice that each service had
forked and was dealing with its own copy of the server.

> >
> > https://paste.fedoraproject.org/paste/-73zskccaQvpSVwRJD11cA
> >
> > The log trace shows multiple calls to start, wait, stop via different
> > threads to the same TaskServer instance:
> >
> > https://paste.fedoraproject.org/paste/dyPq~lr26sQZtMzHn5w~Vg
> >
> > Is that expected?
>
> Unfortunately in the interim, your pastes seem to have vanished - any
> chance you could repaste them?
>

Ugh - didn't keep a copy.  If you pull down that test code you can use
it to generate those traces.


> Thanks,
> Adam
>
> > On Mon, Jul 31, 2017 at 9:32 PM, Adam Spiers <aspiers at suse.com> wrote:
> > > Ken Giusti <kgiusti at gmail.com> wrote:
> > >> On Mon, Jul 31, 2017 at 10:01 AM, Adam Spiers <aspiers at suse.com> wrote:
> > >>> I recently discovered a bug where barbican-worker would hang on
> > >>> shutdown if queue.asynchronous_workers was changed from 1 to 2:
> > >>>
> > >>>    https://bugs.launchpad.net/barbican/+bug/1705543
> > >>>
> > >>> resulting in a warning like this:
> > >>>
> > >>>    WARNING oslo_messaging.server [-] Possible hang: stop is waiting for
> > >>> start to complete
> > >>>
> > >>> I found a similar bug in Sahara:
> > >>>
> > >>>    https://bugs.launchpad.net/sahara/+bug/1546119
> > >>>
> > >>> where the fix was to call start() on the RPC service before making the
> > >>> launcher wait() on it, so I ported the fix to Barbican, and it seems
> > >>> to work fine:
> > >>>
> > >>>    https://review.openstack.org/#/c/485755
> > >>>
> > >>> I noticed that both projects use ProcessLauncher; barbican uses
> > >>> oslo_service.service.launch() which has:
> > >>>
> > >>>    if workers is None or workers == 1:
> > >>>        launcher = ServiceLauncher(conf, restart_method=restart_method)
> > >>>    else:
> > >>>        launcher = ProcessLauncher(conf, restart_method=restart_method)
> > >>>
> > >>> However, I'm not an expert in oslo.service or oslo.messaging, and one
> > >>> of Barbican's core reviewers (thanks Kaitlin!) noted that not many
> > >>> other projects start the task before calling wait() on the launcher,
> > >>> so I thought I'd check here whether that is the correct fix, or
> > >>> whether there's something else odd going on.
> > >>>
> > >>> Any oslo gurus able to shed light on this?
> > >>>
> > >>
> > >> As far as an oslo.messaging server is concerned, the order of operations
> > >> is:
> > >>
> > >> server.start()
> > >> # do stuff until ready to stop the server...
> > >> server.stop()
> > >> server.wait()
> > >>
> > >> The final wait blocks until all requests that are in progress when stop()
> > >> is called finish and cleanup.
> > >
> > > Thanks - that makes sense.  So the question is, why would
> > > barbican-worker only hang on shutdown when there are multiple workers?
> > > Maybe the real bug is somewhere in oslo_service.service.ProcessLauncher
> > > and it's not calling start() correctly?




-- 
Ken Giusti  (kgiusti at gmail.com)



More information about the OpenStack-dev mailing list