Open Stack

Mon Dec 28 17:45:39 UTC 2015

On 12/24/2015 02:30 PM, Clint Byrum wrote:
> This is entirely philosophical, but we should think about when it is
> appropriate to adopt which mode of operation.
>
> There are basically two ways being discussed:
>
> 1) Fail fast.
> 2) Retry forever.
>
> Fail fast pros- Immediate feedback for problems, no zombies to worry
> about staying dormant and resurrecting because their configs accidentally
> become right again. Much more determinism. Debugging is much simpler. To
> summarize, it's up and working, or down and not.
>
> Fail fast cons- Ripple effects. If you have a database or network blip
> while services are starting, you must be aware of all of the downstream
> dependencies and trigger them to start again, or have automation which
> retries forever, giving up some of the benefits of fail-fast. Circular
> dependencies require special workflow to unroll (Service1 aspect A relies
> on aspect X of service2, service2 aspect X relies on aspect B of service1
> which would start fine without service2).  To summarize: this moves the
> retry-forever problem to orchestration, and complicates some corner cases.
>
> Retry forever pros- Circular dependencies are cake. Blips auto-recover.
> Bring-up orchestration is simpler (start everything, wait..). To
> summarize: this makes orchestration simpler.
>
> Retry forever cons- Non-determinism. It's impossible to just look at the
> thing from outside and know if it is ready to do useful work. May
> actually be hiding intermittent problems, requiring more logging and
> indicators in general to allow analysis.
>
> I honestly think any distributed system needs both.

So do I. I was proposing only that we deal with unrecoverable 
configuration errors on startup in a fail-fast way. I was not proposing 
that we remove the existing functionality that retries requests in the 
occasion where an already-up-and-running scheduler service experiences 
(typically transient) I/O disruptions to a dependent service like the DB 
or MQ.

<snip>
> That said, the scheduler is, IMO, an _extremely_ complex piece of
> OpenStack, with up and down stream dependencies on several levels (which
> is why redesigning it gets debated so often on openstack-dev).

It's actually not all that complex. Or at least, it doesn't need to be :)

Best,
-jay

 > Making
> it fail fast would complicate the process of bringing and keeping an
> OpenStack cloud up. There are probably some benefits I haven't thought
> of, but the main benefit you stated would be that one would know when
> their configuration tooling was wrong and giving their scheduler the
> wrong database information, which is not, IMO, a hard problem (one can
> read the config file after all). But I'm sure we could think of more if
> we tried hard.
>
> I hope I'm not too vague here.. I *want* fail-fast on everything.
> However, I also don't think it can just be a blanket policy without
> requiring everybody to deploy complex orchestration on top.
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>

Open Stack

[openstack-dev] Nova scheduler startup when database is not available

OpenStack

Community

Documentation

Branding & Legal