[openstack-dev] [oslo][barbican][sahara] start RPC service before launcher wait?

Adam Spiers aspiers at suse.com
Thu Sep 14 23:33:03 UTC 2017


Hi Ken,

Thanks a lot for the analysis, and sorry for the slow reply!
Comments inline...

Ken Giusti <kgiusti at gmail.com> wrote:
> Hi Adam,
> 
> I think there's a couple of problems here.
> 
> Regardless of worker count, the service.wait() is called before
> service.start().  And from looking at the oslo.service code, the 'wait()'
> method is call after start(), then again after stop().  This doesn't match
> up with the intended use of oslo.messaging.server.wait(), which should only
> be called after .stop().

Hmm, so are you saying that there might be a bug in oslo.service's
usage of oslo.messaging, and that this Sahara bugfix was the wrong
approach too?

https://review.openstack.org/#/c/280741/1/sahara/cli/sahara_engine.py

> Perhaps a bigger issue is that in the multi threaded case all threads
> appear to be calling start, wait, and stop on the same instance of the
> service (oslo.messaging rpc server).  At least that's what I'm seeing in my
> muchly reduced test code:
> 
> https://paste.fedoraproject.org/paste/-73zskccaQvpSVwRJD11cA
> 
> The log trace shows multiple calls to start, wait, stop via different
> threads to the same TaskServer instance:
> 
> https://paste.fedoraproject.org/paste/dyPq~lr26sQZtMzHn5w~Vg
> 
> Is that expected?

Unfortunately in the interim, your pastes seem to have vanished - any
chance you could repaste them?

Thanks,
Adam

> On Mon, Jul 31, 2017 at 9:32 PM, Adam Spiers <aspiers at suse.com> wrote:
> > Ken Giusti <kgiusti at gmail.com> wrote:
> >> On Mon, Jul 31, 2017 at 10:01 AM, Adam Spiers <aspiers at suse.com> wrote:
> >>> I recently discovered a bug where barbican-worker would hang on
> >>> shutdown if queue.asynchronous_workers was changed from 1 to 2:
> >>>
> >>>    https://bugs.launchpad.net/barbican/+bug/1705543
> >>>
> >>> resulting in a warning like this:
> >>>
> >>>    WARNING oslo_messaging.server [-] Possible hang: stop is waiting for
> >>> start to complete
> >>>
> >>> I found a similar bug in Sahara:
> >>>
> >>>    https://bugs.launchpad.net/sahara/+bug/1546119
> >>>
> >>> where the fix was to call start() on the RPC service before making the
> >>> launcher wait() on it, so I ported the fix to Barbican, and it seems
> >>> to work fine:
> >>>
> >>>    https://review.openstack.org/#/c/485755
> >>>
> >>> I noticed that both projects use ProcessLauncher; barbican uses
> >>> oslo_service.service.launch() which has:
> >>>
> >>>    if workers is None or workers == 1:
> >>>        launcher = ServiceLauncher(conf, restart_method=restart_method)
> >>>    else:
> >>>        launcher = ProcessLauncher(conf, restart_method=restart_method)
> >>>
> >>> However, I'm not an expert in oslo.service or oslo.messaging, and one
> >>> of Barbican's core reviewers (thanks Kaitlin!) noted that not many
> >>> other projects start the task before calling wait() on the launcher,
> >>> so I thought I'd check here whether that is the correct fix, or
> >>> whether there's something else odd going on.
> >>>
> >>> Any oslo gurus able to shed light on this?
> >>>
> >>
> >> As far as an oslo.messaging server is concerned, the order of operations
> >> is:
> >>
> >> server.start()
> >> # do stuff until ready to stop the server...
> >> server.stop()
> >> server.wait()
> >>
> >> The final wait blocks until all requests that are in progress when stop()
> >> is called finish and cleanup.
> >
> > Thanks - that makes sense.  So the question is, why would
> > barbican-worker only hang on shutdown when there are multiple workers?
> > Maybe the real bug is somewhere in oslo_service.service.ProcessLauncher
> > and it's not calling start() correctly?



More information about the OpenStack-dev mailing list