On Wed, 2015-10-07 at 23:17 -0600, Chris Friesen wrote:
> Why is it inevitable?

Well, I would say that this is probably a consequence of the CAP[1]

> Theoretically if the DB knew about what resources were originally available and 
> what resources have been consumed, then it should be able to allocate resources 
> race-free (possibly with some retries involved if racing against other 
> schedulers updating the DB, but that would be internal to the scheduler itself).

The problem is, it can't.  The scheduler may be making the decision at
the same time that an update from a compute node is in flight, meaning
that the scheduler is missing (at least) one piece of information.  When
you include a database, that just makes the possibility of missing an
in-flight update worse, because you also have to factor in the latency
of the database update as well.  Also, we have to factor in the
possibility that there are multiple schedulers in play, which further
worsens the possibility of in-flight information critical to the
scheduling decision.  If you employ some sort of locking to try to
mitigate all this, you've just effectively thrown away the scalability
that deploying multiple schedulers was supposed to buy you.

[1] https://en.wikipedia.org/wiki/CAP_theorem
