[openstack-dev] Scheduler proposal

Ian Wells ijw.ubuntu at cack.org.uk
Tue Oct 13 16:24:42 UTC 2015

On 12 October 2015 at 21:18, Clint Byrum <clint at fewbar.com> wrote:

> We _would_ keep a local cache of the information in the schedulers. The
> centralized copy of it is to free the schedulers from the complexity of
> having to keep track of it as state, rather than as a cache. We also don't
> have to provide a way for on-demand stat fetching to seed scheduler 0.

I'm not sure that actually changes.  On restart of a scheduler, it wouldn't
have enough knowledge to schedule, but the other schedulers are not and can
service requests while it waits for data.  Using ZK, that takes fewer
seconds because it can get a braindump, but during that window in either
case the system works at n-1/n capacity assuming queries are only done in

Also, you were seeming to tout the ZK option would take less memory, but it
seems it would take more.  You can't schedule without a relatively complete
set of information or some relatively intricate query language, which I
didn't think ZK was up to (but I'm open to correction there, certainly).
That implies that when you notify a scheduler of a change to the data
model, it's going to grab the fresh data and keep it locally.

> > Also, the notification path here is that the compute host notifies ZK and
> > ZK notifies many schedulers, assuming they're all capable of handling all
> > queries.  That is in fact N * (M+1) messages, which is slightly more than
> > if there's no central node, as it happens.  There are fewer *channels*,
> but
> > more messages.  (I feel like I'm overlooking something here, but I can't
> > pick out the flaw...)  Yes, RMQ will suck at this - but then let's talk
> > about better messaging rather than another DB type.
> >
> You're calling transactions messages, and that's not really fair to
> messaging or transactions. :)

I was actually talking about the number of messages crossing the network.
Your point is that the transaction with ZK is heavier weight than the
update processing at the schedulers, I think.  But then removing ZK as a
nexus removes that transaction, so both the number of messages and the
number of transactions goes down.

However, it's important to note that in
> this situation, compute nodes do not have to send anything anywhere if
> nothing has changed, which is very likely the case for "full" compute
> nodes, and certainly will save many many redundant messages.

Now that's a fair comment, certainly, and would drastically reduce the
number of messages in the system if we can keep the nodes from updating
just because their free memory has changed by a couple of pages.

> Forgive me
> if nova already makes this optimization somehow, it didn't seem to when
> I was tinkering a year ago.

Not as far as I know, it doesn't.

There is also the complexity of designing a scheduler which is fault
> tolerant and scales economically. What we have now will overtax the
> message bus and the database as the number of compute nodes increases.
> We want to get O(1) complexity out of that, but we're getting O(N)
> right now.

O(N) will work providing O is small. ;)

I think our cost currently lies in doing 1 MySQL DB update per node per
minute, and one really quite mad query per schedule.  I agree that ZK would
be less costly for that in both respects, which is really more about
lowering O than N.  I'm wondering if we can do better still, that's all,
but we both agree that this approach would work.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20151013/40113026/attachment.html>

More information about the OpenStack-dev mailing list