Open Stack

Tue Jun 18 14:33:35 UTC 2013

Well, (not in any kind of priority order), the issues/ideas I've heard are:

1)  Don't send data to scheduler, have the scheduler poll each compute node on each schedule request.  I'm pretty sure that, in a large cloud with 100s if not 1000s of compute nodes, this would add too much latency to the schedule request.

2)  Fan-out data is too laggy.  I don't understand why the fan-out messages should be any more laggy than updating the DB.  In both case a message has to be sent to a remote server (fan-out messages to the scheduler, DB updates to the DB server).  Given that the bulk of the lag should be in I would expect the lags to be approximately equivalent.

3)  Fan-out messages are too infrequent.  Currently, the fan-out messages only go out on a periodic basis (right now every 60 seconds) which does lead to stale data.  I believe that the DB is being updated on every state change, I would suggest that we just send a fan-out message on every state change rather than updating the DB.  If my point 2 is correct then this should require the same overhead and so shouldn't be a problem.

4)  One suggestion was to re-architect the scheduler to `remember` resources between requests.  This seems like a major effort, as pointed out, potentially raises coherency issues and if we do fan-outs on every state change is not needed.

--
Don Dugger
"Censeo Toto nos in Kansa esse decisse." - D. Gale
Ph: 303/443-3786

-----Original Message-----
From: Wang, Shane [mailto:shane.wang at intel.com] 
Sent: Tuesday, June 18, 2013 6:13 AM
To: OpenStack Development Mailing List
Subject: Re: [openstack-dev] Compute node stats sent to the scheduler

Hi,

I am new in this area. I got an idea but didn't know whether that works.
Fanout_cast is expensive and DB could be a burden. Can we maintain the stat data at nodes, and when and only when a scheduler needs to do any scheduling, the scheduler proactively to ask nodes their stats?
The assumption is scheduling doesn't happen frequently, compared with the frequency of fanout_cast?

Best Regards.
--
Shane

Brian Elliott wrote on 2013-06-18:

> 
> On Jun 17, 2013, at 3:50 PM, Chris Behrens <cbehrens at codestud.com> wrote:
> 
>> 
>> On Jun 17, 2013, at 7:49 AM, Russell Bryant <rbryant at redhat.com> wrote:
>> 
>>> On 06/16/2013 11:25 PM, Dugger, Donald D wrote:
>>>> Looking into the scheduler a bit there's an issue of duplicated effort that is a
> little puzzling.  The database table `compute_nodes' is being updated
> periodically with data about capabilities and resources used (memory, vcpus, ...)
> while at the same time a periodic RPC call is being made to the scheduler sending
> pretty much the same data.
>>>> 
>>>> Does anyone know why we are updating the same data in two different
> place using two different mechanisms?  Also, assuming we were to remove one
> of these updates, which one should go?  (I thought at one point in time there
> was a goal to create a database free compute node which would imply we should
> remove the DB update.)
>>> 
>>> Have you looked around to see if any code is using the data from the db?
>>> 
>>> Having schedulers hit the db for the current state of all compute nodes
>>> all of the time would be a large additional db burden that I think we
>>> should avoid.  So, it makes sense to keep the rpc fanout_cast of current
>>> stats to schedulers.
>> 
>> This is actually what the scheduler uses. :)   The fanout messages are too
> infrequent and can be too laggy.  So, the scheduler was moved to using the DB
> a long, long time ago. but it was very inefficient, at first, because it looped
> through all instances.  So we added things we needed into compute_node and
> compute_node_stats so we only had to look at the hosts.  You have to pull the
> hosts anyway, so we pull the stats at the same time.
>> 
>> The problem is. when we stopped using certain data from the fanout
> messages.. we never removed it.   We should AT LEAST do this.  But.. (see
> below)..
>> 
>>> 
>>> The scheduler also does a fanout_cast to all compute nodes when it
>>> starts up to trigger the compute nodes to populate the cache in the
>>> scheduler.  It would be nice to never fanout_cast to all compute nodes
>>> (given that there may be a *lot* of them).  We could replace this with
>>> having the scheduler populate its cache from the database.
>> 
>> I think we should audit the remaining things that the scheduler uses from these
> messages and move them to the DB.  I believe it's limited to the hypervisor
> capabilities to compare against aggregates or some such.  I believe it's things
> that change very rarely. so an alternative can be to only send fanout messages
> when capabilities change!   We could always do that as a first step.
>> 
>>> 
>>> Removing the db usage completely would be nice if nothing is actually
>>> using it, but we'd have to look into an alternative solution for
>>> removing the scheduler fanout_cast to compute.
>> 
>> Relying on anything but the DB for current memory free, etc, is just
>> too laggy. so we need to stick with it, IMO.
>> 
>> - Chris
>> 
>> 
>> _______________________________________________
>> OpenStack-dev mailing list
>> OpenStack-dev at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 
> As Chris said, the reason it ended up this way using the DB is to quickly get up to
> date usage on hosts to the scheduler.  I certainly understand the point that it's a
> whole lot of increased load on the DB, but the RPC data was quite stale.  If there
> is interest in moving away from the DB updates, I think we have to either:
> 
> 1) Send RPC updates to scheduler  on essentially every state change
> during a build.
> 
> or
> 
> 2) Change the scheduler architecture so there is some "memory" of
> resources consumed between requests.  The scheduler would have to
> remember which hosts recent builds were assigned to.  This could be a
> bit of a data synchronization problem. if you're talking about using
> multiple scheduler instances.
> 
> Brian
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev at lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Open Stack

[openstack-dev] Compute node stats sent to the scheduler

OpenStack

Community

Documentation

Branding & Legal