[openstack-dev] Compute node stats sent to the scheduler
Dugger, Donald D
donald.d.dugger at intel.com
Tue Jun 18 14:33:35 UTC 2013
Well, (not in any kind of priority order), the issues/ideas I've heard are:
1) Don't send data to scheduler, have the scheduler poll each compute node on each schedule request. I'm pretty sure that, in a large cloud with 100s if not 1000s of compute nodes, this would add too much latency to the schedule request.
2) Fan-out data is too laggy. I don't understand why the fan-out messages should be any more laggy than updating the DB. In both case a message has to be sent to a remote server (fan-out messages to the scheduler, DB updates to the DB server). Given that the bulk of the lag should be in I would expect the lags to be approximately equivalent.
3) Fan-out messages are too infrequent. Currently, the fan-out messages only go out on a periodic basis (right now every 60 seconds) which does lead to stale data. I believe that the DB is being updated on every state change, I would suggest that we just send a fan-out message on every state change rather than updating the DB. If my point 2 is correct then this should require the same overhead and so shouldn't be a problem.
4) One suggestion was to re-architect the scheduler to `remember` resources between requests. This seems like a major effort, as pointed out, potentially raises coherency issues and if we do fan-outs on every state change is not needed.
--
Don Dugger
"Censeo Toto nos in Kansa esse decisse." - D. Gale
Ph: 303/443-3786
-----Original Message-----
From: Wang, Shane [mailto:shane.wang at intel.com]
Sent: Tuesday, June 18, 2013 6:13 AM
To: OpenStack Development Mailing List
Subject: Re: [openstack-dev] Compute node stats sent to the scheduler
Hi,
I am new in this area. I got an idea but didn't know whether that works.
Fanout_cast is expensive and DB could be a burden. Can we maintain the stat data at nodes, and when and only when a scheduler needs to do any scheduling, the scheduler proactively to ask nodes their stats?
The assumption is scheduling doesn't happen frequently, compared with the frequency of fanout_cast?
Best Regards.
--
Shane
Brian Elliott wrote on 2013-06-18:
>
> On Jun 17, 2013, at 3:50 PM, Chris Behrens <cbehrens at codestud.com> wrote:
>
>>
>> On Jun 17, 2013, at 7:49 AM, Russell Bryant <rbryant at redhat.com> wrote:
>>
>>> On 06/16/2013 11:25 PM, Dugger, Donald D wrote:
>>>> Looking into the scheduler a bit there's an issue of duplicated effort that is a
> little puzzling. The database table `compute_nodes' is being updated
> periodically with data about capabilities and resources used (memory, vcpus, ...)
> while at the same time a periodic RPC call is being made to the scheduler sending
> pretty much the same data.
>>>>
>>>> Does anyone know why we are updating the same data in two different
> place using two different mechanisms? Also, assuming we were to remove one
> of these updates, which one should go? (I thought at one point in time there
> was a goal to create a database free compute node which would imply we should
> remove the DB update.)
>>>
>>> Have you looked around to see if any code is using the data from the db?
>>>
>>> Having schedulers hit the db for the current state of all compute nodes
>>> all of the time would be a large additional db burden that I think we
>>> should avoid. So, it makes sense to keep the rpc fanout_cast of current
>>> stats to schedulers.
>>
>> This is actually what the scheduler uses. :) The fanout messages are too
> infrequent and can be too laggy. So, the scheduler was moved to using the DB
> a long, long time ago. but it was very inefficient, at first, because it looped
> through all instances. So we added things we needed into compute_node and
> compute_node_stats so we only had to look at the hosts. You have to pull the
> hosts anyway, so we pull the stats at the same time.
>>
>> The problem is. when we stopped using certain data from the fanout
> messages.. we never removed it. We should AT LEAST do this. But.. (see
> below)..
>>
>>>
>>> The scheduler also does a fanout_cast to all compute nodes when it
>>> starts up to trigger the compute nodes to populate the cache in the
>>> scheduler. It would be nice to never fanout_cast to all compute nodes
>>> (given that there may be a *lot* of them). We could replace this with
>>> having the scheduler populate its cache from the database.
>>
>> I think we should audit the remaining things that the scheduler uses from these
> messages and move them to the DB. I believe it's limited to the hypervisor
> capabilities to compare against aggregates or some such. I believe it's things
> that change very rarely. so an alternative can be to only send fanout messages
> when capabilities change! We could always do that as a first step.
>>
>>>
>>> Removing the db usage completely would be nice if nothing is actually
>>> using it, but we'd have to look into an alternative solution for
>>> removing the scheduler fanout_cast to compute.
>>
>> Relying on anything but the DB for current memory free, etc, is just
>> too laggy. so we need to stick with it, IMO.
>>
>> - Chris
>>
>>
>> _______________________________________________
>> OpenStack-dev mailing list
>> OpenStack-dev at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
> As Chris said, the reason it ended up this way using the DB is to quickly get up to
> date usage on hosts to the scheduler. I certainly understand the point that it's a
> whole lot of increased load on the DB, but the RPC data was quite stale. If there
> is interest in moving away from the DB updates, I think we have to either:
>
> 1) Send RPC updates to scheduler on essentially every state change
> during a build.
>
> or
>
> 2) Change the scheduler architecture so there is some "memory" of
> resources consumed between requests. The scheduler would have to
> remember which hosts recent builds were assigned to. This could be a
> bit of a data synchronization problem. if you're talking about using
> multiple scheduler instances.
>
> Brian
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
_______________________________________________
OpenStack-dev mailing list
OpenStack-dev at lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
More information about the OpenStack-dev
mailing list