[openstack-dev] Scheduler proposal

Clint Byrum clint at fewbar.com
Tue Oct 13 04:18:12 UTC 2015

Excerpts from Ian Wells's message of 2015-10-12 19:43:48 -0700:
> On 11 October 2015 at 00:23, Clint Byrum <clint at fewbar.com> wrote:
> > I'm in, except I think this gets simpler with an intermediary service
> > like ZK/Consul to keep track of this 1GB of data and replace the need
> > for 6, and changes the implementation of 5 to "updates its record and
> > signals its presence".
> >
> OK, so we're not keeping a copy of the information in the schedulers,
> saving us 5GB of information, but we are notifying the schedulers of the
> updated information to that they can update their copies?

We _would_ keep a local cache of the information in the schedulers. The
centralized copy of it is to free the schedulers from the complexity of
having to keep track of it as state, rather than as a cache. We also don't
have to provide a way for on-demand stat fetching to seed scheduler 0.

> Also, the notification path here is that the compute host notifies ZK and
> ZK notifies many schedulers, assuming they're all capable of handling all
> queries.  That is in fact N * (M+1) messages, which is slightly more than
> if there's no central node, as it happens.  There are fewer *channels*, but
> more messages.  (I feel like I'm overlooking something here, but I can't
> pick out the flaw...)  Yes, RMQ will suck at this - but then let's talk
> about better messaging rather than another DB type.

You're calling transactions messages, and that's not really fair to
messaging or transactions. :)

If N==Number of Schedulers, then the transaction which records a change
in available resources for a compute node results in 1 transaction, and
N "watches" to the schedulers. However, it's important to note that in
this situation, compute nodes do not have to send anything anywhere if
nothing has changed, which is very likely the case for "full" compute
nodes, and certainly will save many many redundant messages. Forgive me
if nova already makes this optimization somehow, it didn't seem to when
I was tinkering a year ago.

> Again, the saving here seems to be that a freshly started scheduler can get
> an infodump rather than waiting 60s to be useful.  I wonder if that's
> necessary.

There is also the complexity of designing a scheduler which is fault
tolerant and scales economically. What we have now will overtax the
message bus and the database as the number of compute nodes increases.
We want to get O(1) complexity out of that, but we're getting O(N)
right now.

More information about the OpenStack-dev mailing list