[openstack-dev] Nova scheduler startup when database is not available
mbayer at redhat.com
Thu Dec 24 00:30:47 UTC 2015
On 12/23/2015 01:32 PM, Jay Pipes wrote:
> On 12/23/2015 12:27 PM, Lars Kellogg-Stedman wrote:
>> I've been looking into the startup constraints involved when launching
>> Nova services with systemd using Type=notify (which causes systemd to
>> wait for an explicit notification from the service before considering
>> it to be "started". Some services (e.g., nova-conductor) will happily
>> "start" even if the backing database is currently unavailable (and
>> will enter a retry loop waiting for the database).
>> Other services -- specifically, nova-scheduler -- will block waiting
>> for the database *before* providing systemd with the necessary
>> nova-scheduler blocks because it wants to initialize a list of
>> available aggregates (in scheduler.host_manager.HostManager.__init__),
>> which it gets by calling objects.AggregateList.get_all.
>> Does it make sense to block service startup at this stage? The
>> database disappearing during runtime isn't a hard error -- we will
>> retry and reconnect when it comes back -- so should the same situation
>> at startup be a hard error? As an operator, I am more interested in
>> "did my configuration files parse correctly?" at startup, and would
>> generally prefer the service to start (and permit any dependent
>> services to start) even when the database isn't up (because that's
>> probably a situation of which I am already aware).
> If your configuration file parsed correctly but has the wrong database
> connection URI, what good is the service in an active state? It won't be
> able to do anything at all.
this is true, but to be fair, Nova doesn't work like this at all, at
least not in nova/db/sqlalchemy/api.py. It is very intentionally
designed to *not* connect to the database until an API call is first
accessed, to the extent that it does an end-run around oslo.db's
create_engine() feature which itself does a "test" connection when it is
called (FTR, SQLAlchemy's create_engine() that is called by oslo.db is
in fact a lazy-initializing function). I find it quite awkward
overall that oslo.db reverses SQLAlchemy's "lazyness", but then nova and
others re-reverse *back* to "lazyness", but at the expense of allowing
oslo.db's create_engine() to receive its configuration up front.
In the reworked enginefacade API I went through a lot of effort to
replicate this behavior. It would be nice if all Openstack apps could
just pick one paradigm and stick with it so that we can just make
oslo.db do *one* pattern and that's all (probably too late though).
> This is why I think it's better to have hard checks like for connections
> on startup and not have services active if they won't be able to do
> anything useful.
>> It would be relatively easy to have the scheduler lazy-load the list
>> of aggregates on first references, rather than at __init__.
> Sure, but if the root cause of the issue is a problem due to
> misconfigured connection string, then that lazy-load will just bomb out
> and the scheduler will be useless anyway. I'd rather have a
> fail-early/fast occur here than a fail-late.
>> I'm not
>> familiar enough with the nova code to know if there would be any
>> undesirable implications of this behavior. We're already punting
>> initializing the list of instances to an asynchronous task in order to
>> avoid blocking service startup.
>> Does it make sense to permit nova-scheduler to complete service
>> startup in the absence of the database (and then retry the connection
>> in the background)?
>> OpenStack Development Mailing List (not for usage questions)
>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
More information about the OpenStack-dev