[openstack-dev] [oslo][barbican][sahara] start RPC service before launcher wait?
aspiers at suse.com
Thu Sep 14 23:33:03 UTC 2017
Thanks a lot for the analysis, and sorry for the slow reply!
Ken Giusti <kgiusti at gmail.com> wrote:
> Hi Adam,
> I think there's a couple of problems here.
> Regardless of worker count, the service.wait() is called before
> service.start(). And from looking at the oslo.service code, the 'wait()'
> method is call after start(), then again after stop(). This doesn't match
> up with the intended use of oslo.messaging.server.wait(), which should only
> be called after .stop().
Hmm, so are you saying that there might be a bug in oslo.service's
usage of oslo.messaging, and that this Sahara bugfix was the wrong
> Perhaps a bigger issue is that in the multi threaded case all threads
> appear to be calling start, wait, and stop on the same instance of the
> service (oslo.messaging rpc server). At least that's what I'm seeing in my
> muchly reduced test code:
> The log trace shows multiple calls to start, wait, stop via different
> threads to the same TaskServer instance:
> Is that expected?
Unfortunately in the interim, your pastes seem to have vanished - any
chance you could repaste them?
> On Mon, Jul 31, 2017 at 9:32 PM, Adam Spiers <aspiers at suse.com> wrote:
> > Ken Giusti <kgiusti at gmail.com> wrote:
> >> On Mon, Jul 31, 2017 at 10:01 AM, Adam Spiers <aspiers at suse.com> wrote:
> >>> I recently discovered a bug where barbican-worker would hang on
> >>> shutdown if queue.asynchronous_workers was changed from 1 to 2:
> >>> https://bugs.launchpad.net/barbican/+bug/1705543
> >>> resulting in a warning like this:
> >>> WARNING oslo_messaging.server [-] Possible hang: stop is waiting for
> >>> start to complete
> >>> I found a similar bug in Sahara:
> >>> https://bugs.launchpad.net/sahara/+bug/1546119
> >>> where the fix was to call start() on the RPC service before making the
> >>> launcher wait() on it, so I ported the fix to Barbican, and it seems
> >>> to work fine:
> >>> https://review.openstack.org/#/c/485755
> >>> I noticed that both projects use ProcessLauncher; barbican uses
> >>> oslo_service.service.launch() which has:
> >>> if workers is None or workers == 1:
> >>> launcher = ServiceLauncher(conf, restart_method=restart_method)
> >>> else:
> >>> launcher = ProcessLauncher(conf, restart_method=restart_method)
> >>> However, I'm not an expert in oslo.service or oslo.messaging, and one
> >>> of Barbican's core reviewers (thanks Kaitlin!) noted that not many
> >>> other projects start the task before calling wait() on the launcher,
> >>> so I thought I'd check here whether that is the correct fix, or
> >>> whether there's something else odd going on.
> >>> Any oslo gurus able to shed light on this?
> >> As far as an oslo.messaging server is concerned, the order of operations
> >> is:
> >> server.start()
> >> # do stuff until ready to stop the server...
> >> server.stop()
> >> server.wait()
> >> The final wait blocks until all requests that are in progress when stop()
> >> is called finish and cleanup.
> > Thanks - that makes sense. So the question is, why would
> > barbican-worker only hang on shutdown when there are multiple workers?
> > Maybe the real bug is somewhere in oslo_service.service.ProcessLauncher
> > and it's not calling start() correctly?
More information about the OpenStack-dev