[openstack-dev] Nova scheduler startup when database is not available
Fox, Kevin M
Kevin.Fox at pnnl.gov
Mon Dec 28 17:58:52 UTC 2015
Another data point.. I've had to work around daemons failing fast as discussed below when working with docker-compose. It doesn't have nice dependency handling yet, and during the initial bootstrap of all the containers in a pod, some can fail due to not sticking around long enough for the things to init. Its kind of painful. Fail fast has some nice features, but retry forever is often very useful in the field.
From: Jay Pipes [jaypipes at gmail.com]
Sent: Monday, December 28, 2015 9:45 AM
To: openstack-dev at lists.openstack.org
Subject: Re: [openstack-dev] Nova scheduler startup when database is not available
On 12/24/2015 02:30 PM, Clint Byrum wrote:
> This is entirely philosophical, but we should think about when it is
> appropriate to adopt which mode of operation.
> There are basically two ways being discussed:
> 1) Fail fast.
> 2) Retry forever.
> Fail fast pros- Immediate feedback for problems, no zombies to worry
> about staying dormant and resurrecting because their configs accidentally
> become right again. Much more determinism. Debugging is much simpler. To
> summarize, it's up and working, or down and not.
> Fail fast cons- Ripple effects. If you have a database or network blip
> while services are starting, you must be aware of all of the downstream
> dependencies and trigger them to start again, or have automation which
> retries forever, giving up some of the benefits of fail-fast. Circular
> dependencies require special workflow to unroll (Service1 aspect A relies
> on aspect X of service2, service2 aspect X relies on aspect B of service1
> which would start fine without service2). To summarize: this moves the
> retry-forever problem to orchestration, and complicates some corner cases.
> Retry forever pros- Circular dependencies are cake. Blips auto-recover.
> Bring-up orchestration is simpler (start everything, wait..). To
> summarize: this makes orchestration simpler.
> Retry forever cons- Non-determinism. It's impossible to just look at the
> thing from outside and know if it is ready to do useful work. May
> actually be hiding intermittent problems, requiring more logging and
> indicators in general to allow analysis.
> I honestly think any distributed system needs both.
So do I. I was proposing only that we deal with unrecoverable
configuration errors on startup in a fail-fast way. I was not proposing
that we remove the existing functionality that retries requests in the
occasion where an already-up-and-running scheduler service experiences
(typically transient) I/O disruptions to a dependent service like the DB
> That said, the scheduler is, IMO, an _extremely_ complex piece of
> OpenStack, with up and down stream dependencies on several levels (which
> is why redesigning it gets debated so often on openstack-dev).
It's actually not all that complex. Or at least, it doesn't need to be :)
> it fail fast would complicate the process of bringing and keeping an
> OpenStack cloud up. There are probably some benefits I haven't thought
> of, but the main benefit you stated would be that one would know when
> their configuration tooling was wrong and giving their scheduler the
> wrong database information, which is not, IMO, a hard problem (one can
> read the config file after all). But I'm sure we could think of more if
> we tried hard.
> I hope I'm not too vague here.. I *want* fail-fast on everything.
> However, I also don't think it can just be a blanket policy without
> requiring everybody to deploy complex orchestration on top.
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
More information about the OpenStack-dev