[openstack-dev] Nova scheduler startup when database is not available

Sylvain Bauza sbauza at redhat.com
Thu Dec 24 09:46:03 UTC 2015



Le 24/12/2015 02:35, Morgan Fainberg a écrit :
>
>
> On Wed, Dec 23, 2015 at 10:32 AM, Jay Pipes <jaypipes at gmail.com 
> <mailto:jaypipes at gmail.com>> wrote:
>
>     On 12/23/2015 12:27 PM, Lars Kellogg-Stedman wrote:
>
>         I've been looking into the startup constraints involved when
>         launching
>         Nova services with systemd using Type=notify (which causes
>         systemd to
>         wait for an explicit notification from the service before
>         considering
>         it to be "started".  Some services (e.g., nova-conductor) will
>         happily
>         "start" even if the backing database is currently unavailable (and
>         will enter a retry loop waiting for the database).
>
>         Other services -- specifically, nova-scheduler -- will block
>         waiting
>         for the database *before* providing systemd with the necessary
>         notification.
>
>         nova-scheduler blocks because it wants to initialize a list of
>         available aggregates (in
>         scheduler.host_manager.HostManager.__init__),
>         which it gets by calling objects.AggregateList.get_all.
>
>         Does it make sense to block service startup at this stage?  The
>         database disappearing during runtime isn't a hard error -- we will
>         retry and reconnect when it comes back -- so should the same
>         situation
>         at startup be a hard error?  As an operator, I am more
>         interested in
>         "did my configuration files parse correctly?" at startup, and
>         would
>         generally prefer the service to start (and permit any dependent
>         services to start) even when the database isn't up (because that's
>         probably a situation of which I am already aware).
>
>
>     If your configuration file parsed correctly but has the wrong
>     database connection URI, what good is the service in an active
>     state? It won't be able to do anything at all.
>
>     This is why I think it's better to have hard checks like for
>     connections on startup and not have services active if they won't
>     be able to do anything useful.
>
>
> Are you advocating that scheduler bails out and ceases to run or that 
> it doesn't mark itself as active? I am in favour of the second 
> scenario but not the first. There are cases where it would be nice to 
> start the scheduler and have it at least report "hey I can't contact 
> the DB" but not mark itself active, but continue to run and on 
> <interval> report/try to reconnect.
>
> It isn't clear which level of "hard check" you're advocating in your 
> response and I want to clarify for the sake of conversation.

So, to be clear, the scheduler calls the DB to get the list of 
aggregates and instances for not calling the DB anytime a filter wants 
to check those, but rather look at in-memory.
While it means that it's only needed for the above filters, it still 
means that if the DB is ill, the scheduler wouldn't work - just because 
even if the service is running, any request call to the scheduler would 
return an exception.

So, what's better, you think ? Having a scheduler saying in an error log 
"heh cool, the DB is bad, but okay, you can call me" or rather "meh, you 
have a config issue, please review it" ?

to be honest, we can maybe have a better way to document why the 
scheduler is not starting when it's not possible to call the DB, but I'm 
not sure it's good to have a scheduler resilitient vs. the DB.

-Sylvain

>         It would be relatively easy to have the scheduler lazy-load
>         the list
>         of aggregates on first references, rather than at __init__.
>
>
>     Sure, but if the root cause of the issue is a problem due to
>     misconfigured connection string, then that lazy-load will just
>     bomb out and the scheduler will be useless anyway. I'd rather have
>     a fail-early/fast occur here than a fail-late.
>
>     Best,
>     -jay
>
>     > I'm not
>
>         familiar enough with the nova code to know if there would be any
>         undesirable implications of this behavior.  We're already punting
>         initializing the list of instances to an asynchronous task in
>         order to
>         avoid blocking service startup.
>
>         Does it make sense to permit nova-scheduler to complete service
>         startup in the absence of the database (and then retry the
>         connection
>         in the background)?
>
>
>
>         __________________________________________________________________________
>         OpenStack Development Mailing List (not for usage questions)
>         Unsubscribe:
>         OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>         <http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe>
>         http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
>     __________________________________________________________________________
>     OpenStack Development Mailing List (not for usage questions)
>     Unsubscribe:
>     OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>     <http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe>
>     http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20151224/f5d2a554/attachment.html>


More information about the OpenStack-dev mailing list