[openstack-dev] Scheduler proposal

Clint Byrum clint at fewbar.com
Tue Oct 13 16:51:15 UTC 2015

Excerpts from Ian Wells's message of 2015-10-13 09:24:42 -0700:
> On 12 October 2015 at 21:18, Clint Byrum <clint at fewbar.com> wrote:
> > We _would_ keep a local cache of the information in the schedulers. The
> > centralized copy of it is to free the schedulers from the complexity of
> > having to keep track of it as state, rather than as a cache. We also don't
> > have to provide a way for on-demand stat fetching to seed scheduler 0.
> >
> I'm not sure that actually changes.  On restart of a scheduler, it wouldn't
> have enough knowledge to schedule, but the other schedulers are not and can
> service requests while it waits for data.  Using ZK, that takes fewer
> seconds because it can get a braindump, but during that window in either
> case the system works at n-1/n capacity assuming queries are only done in
> memory.

Yeah, I'd put this as a 3 on the 1-10 scale of optimizations. Not a
reason to do it, but an assessment that it improves the efficiency of
starting new schedulers. It also has the benefit that if you do choose
to just run 1 scheduler, you can just start a new one and it will walk
the tree and start scheduling immediately thereafter.

> Also, you were seeming to tout the ZK option would take less memory, but it
> seems it would take more.  You can't schedule without a relatively complete
> set of information or some relatively intricate query language, which I
> didn't think ZK was up to (but I'm open to correction there, certainly).
> That implies that when you notify a scheduler of a change to the data
> model, it's going to grab the fresh data and keep it locally.

If I did that, I was being unclear and I'm sorry for that. I do think
the cache of potential scheduling targets and stats should fit in RAM
easily for even 100,000 nodes, including indexes for fast lookups.
The intermediary is entirely to alleviate the need for complicated sync
protocols to be implemented in the scheduler and compute agent. RAM is
cheap, time is not.

> > > Also, the notification path here is that the compute host notifies ZK and
> > > ZK notifies many schedulers, assuming they're all capable of handling all
> > > queries.  That is in fact N * (M+1) messages, which is slightly more than
> > > if there's no central node, as it happens.  There are fewer *channels*,
> > but
> > > more messages.  (I feel like I'm overlooking something here, but I can't
> > > pick out the flaw...)  Yes, RMQ will suck at this - but then let's talk
> > > about better messaging rather than another DB type.
> > >
> >
> > You're calling transactions messages, and that's not really fair to
> > messaging or transactions. :)
> >
> I was actually talking about the number of messages crossing the network.
> Your point is that the transaction with ZK is heavier weight than the
> update processing at the schedulers, I think.  But then removing ZK as a
> nexus removes that transaction, so both the number of messages and the
> number of transactions goes down.

Point taken and agreed.

> However, it's important to note that in
> > this situation, compute nodes do not have to send anything anywhere if
> > nothing has changed, which is very likely the case for "full" compute
> > nodes, and certainly will save many many redundant messages.
> Now that's a fair comment, certainly, and would drastically reduce the
> number of messages in the system if we can keep the nodes from updating
> just because their free memory has changed by a couple of pages.

Indeed, an optimization like this is actually orthogonal to the management
of the corpus of state from all hosts. Hosts should in fact be able
to optimize for this already. Of course, then you lose the heartbeat..
which might be more valuable than the savings in communication load.

> > Forgive me
> > if nova already makes this optimization somehow, it didn't seem to when
> > I was tinkering a year ago.
> >
> Not as far as I know, it doesn't.
> There is also the complexity of designing a scheduler which is fault
> > tolerant and scales economically. What we have now will overtax the
> > message bus and the database as the number of compute nodes increases.
> > We want to get O(1) complexity out of that, but we're getting O(N)
> > right now.
> >
> O(N) will work providing O is small. ;)
> I think our cost currently lies in doing 1 MySQL DB update per node per
> minute, and one really quite mad query per schedule.  I agree that ZK would
> be less costly for that in both respects, which is really more about
> lowering O than N.  I'm wondering if we can do better still, that's all,
> but we both agree that this approach would work.

Right, I think it is worth an experiment if for no other reason than
MySQL can't really go much faster for this. We could move the mad query
out into RAM, but then we get the problem of how to keep a useful dataset
in RAM and we're back to syncing or polling the database hard.

More information about the OpenStack-dev mailing list