[openstack-dev] Scheduler proposal

Joshua Harlow harlowja at fastmail.com
Tue Oct 13 17:23:06 UTC 2015


Clint Byrum wrote:
> Excerpts from Ian Wells's message of 2015-10-13 09:24:42 -0700:
>> On 12 October 2015 at 21:18, Clint Byrum<clint at fewbar.com>  wrote:
>>
>>> We _would_ keep a local cache of the information in the schedulers. The
>>> centralized copy of it is to free the schedulers from the complexity of
>>> having to keep track of it as state, rather than as a cache. We also don't
>>> have to provide a way for on-demand stat fetching to seed scheduler 0.
>>>
>> I'm not sure that actually changes.  On restart of a scheduler, it wouldn't
>> have enough knowledge to schedule, but the other schedulers are not and can
>> service requests while it waits for data.  Using ZK, that takes fewer
>> seconds because it can get a braindump, but during that window in either
>> case the system works at n-1/n capacity assuming queries are only done in
>> memory.
>>
>
> Yeah, I'd put this as a 3 on the 1-10 scale of optimizations. Not a
> reason to do it, but an assessment that it improves the efficiency of
> starting new schedulers. It also has the benefit that if you do choose
> to just run 1 scheduler, you can just start a new one and it will walk
> the tree and start scheduling immediately thereafter.
>
>> Also, you were seeming to tout the ZK option would take less memory, but it
>> seems it would take more.  You can't schedule without a relatively complete
>> set of information or some relatively intricate query language, which I
>> didn't think ZK was up to (but I'm open to correction there, certainly).
>> That implies that when you notify a scheduler of a change to the data
>> model, it's going to grab the fresh data and keep it locally.
>>
>
> If I did that, I was being unclear and I'm sorry for that. I do think
> the cache of potential scheduling targets and stats should fit in RAM
> easily for even 100,000 nodes, including indexes for fast lookups.
> The intermediary is entirely to alleviate the need for complicated sync
> protocols to be implemented in the scheduler and compute agent. RAM is
> cheap, time is not.

+1

Servers come with many tens/hundreds gigabytes of memory now-a-days, and 
if we locally cache with various levels of indexing (perhaps even using 
some other db-like library to help here) then I'd hope we can fit as 
many nodes as we desire.

>
>>>> Also, the notification path here is that the compute host notifies ZK and
>>>> ZK notifies many schedulers, assuming they're all capable of handling all
>>>> queries.  That is in fact N * (M+1) messages, which is slightly more than
>>>> if there's no central node, as it happens.  There are fewer *channels*,
>>> but
>>>> more messages.  (I feel like I'm overlooking something here, but I can't
>>>> pick out the flaw...)  Yes, RMQ will suck at this - but then let's talk
>>>> about better messaging rather than another DB type.
>>>>
>>> You're calling transactions messages, and that's not really fair to
>>> messaging or transactions. :)
>>>
>> I was actually talking about the number of messages crossing the network.
>> Your point is that the transaction with ZK is heavier weight than the
>> update processing at the schedulers, I think.  But then removing ZK as a
>> nexus removes that transaction, so both the number of messages and the
>> number of transactions goes down.
>>
>
> Point taken and agreed.
>
>> However, it's important to note that in
>>> this situation, compute nodes do not have to send anything anywhere if
>>> nothing has changed, which is very likely the case for "full" compute
>>> nodes, and certainly will save many many redundant messages.
>>
>> Now that's a fair comment, certainly, and would drastically reduce the
>> number of messages in the system if we can keep the nodes from updating
>> just because their free memory has changed by a couple of pages.
>>
>
> Indeed, an optimization like this is actually orthogonal to the management
> of the corpus of state from all hosts. Hosts should in fact be able
> to optimize for this already. Of course, then you lose the heartbeat..
> which might be more valuable than the savings in communication load.
>
>>> Forgive me
>>> if nova already makes this optimization somehow, it didn't seem to when
>>> I was tinkering a year ago.
>>>
>> Not as far as I know, it doesn't.
>>
>> There is also the complexity of designing a scheduler which is fault
>>> tolerant and scales economically. What we have now will overtax the
>>> message bus and the database as the number of compute nodes increases.
>>> We want to get O(1) complexity out of that, but we're getting O(N)
>>> right now.
>>>
>> O(N) will work providing O is small. ;)
>>
>> I think our cost currently lies in doing 1 MySQL DB update per node per
>> minute, and one really quite mad query per schedule.  I agree that ZK would
>> be less costly for that in both respects, which is really more about
>> lowering O than N.  I'm wondering if we can do better still, that's all,
>> but we both agree that this approach would work.
>
> Right, I think it is worth an experiment if for no other reason than
> MySQL can't really go much faster for this. We could move the mad query
> out into RAM, but then we get the problem of how to keep a useful dataset
> in RAM and we're back to syncing or polling the database hard.
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



More information about the OpenStack-dev mailing list