[openstack-dev] Nova scheduler startup when database is not available

Clint Byrum clint at fewbar.com
Tue Dec 29 04:19:26 UTC 2015

Excerpts from Jay Pipes's message of 2015-12-28 09:45:39 -0800:
> On 12/24/2015 02:30 PM, Clint Byrum wrote:
> > This is entirely philosophical, but we should think about when it is
> > appropriate to adopt which mode of operation.
> >
> > There are basically two ways being discussed:
> >
> > 1) Fail fast.
> > 2) Retry forever.
> >
> > Fail fast pros- Immediate feedback for problems, no zombies to worry
> > about staying dormant and resurrecting because their configs accidentally
> > become right again. Much more determinism. Debugging is much simpler. To
> > summarize, it's up and working, or down and not.
> >
> > Fail fast cons- Ripple effects. If you have a database or network blip
> > while services are starting, you must be aware of all of the downstream
> > dependencies and trigger them to start again, or have automation which
> > retries forever, giving up some of the benefits of fail-fast. Circular
> > dependencies require special workflow to unroll (Service1 aspect A relies
> > on aspect X of service2, service2 aspect X relies on aspect B of service1
> > which would start fine without service2).  To summarize: this moves the
> > retry-forever problem to orchestration, and complicates some corner cases.
> >
> > Retry forever pros- Circular dependencies are cake. Blips auto-recover.
> > Bring-up orchestration is simpler (start everything, wait..). To
> > summarize: this makes orchestration simpler.
> >
> > Retry forever cons- Non-determinism. It's impossible to just look at the
> > thing from outside and know if it is ready to do useful work. May
> > actually be hiding intermittent problems, requiring more logging and
> > indicators in general to allow analysis.
> >
> > I honestly think any distributed system needs both.
> So do I. I was proposing only that we deal with unrecoverable 
> configuration errors on startup in a fail-fast way. I was not proposing 
> that we remove the existing functionality that retries requests in the 
> occasion where an already-up-and-running scheduler service experiences 
> (typically transient) I/O disruptions to a dependent service like the DB 
> or MQ.

Even during startup, failing fast on remote dependencies complicates
things. There's no dependency resolver for the entire cloud, as Kevin
Fox suggested.

> <snip>
> > That said, the scheduler is, IMO, an _extremely_ complex piece of
> > OpenStack, with up and down stream dependencies on several levels (which
> > is why redesigning it gets debated so often on openstack-dev).
> It's actually not all that complex. Or at least, it doesn't need to be :)

On this we definitely agree.

More information about the OpenStack-dev mailing list