[openstack-dev] Compute node stats sent to the scheduler
rbryant at redhat.com
Wed Jun 19 20:54:31 UTC 2013
On 06/17/2013 05:09 PM, Brian Elliott wrote:
> On Jun 17, 2013, at 3:50 PM, Chris Behrens <cbehrens at codestud.com> wrote:
>> On Jun 17, 2013, at 7:49 AM, Russell Bryant <rbryant at redhat.com> wrote:
>>> On 06/16/2013 11:25 PM, Dugger, Donald D wrote:
>>>> Looking into the scheduler a bit there's an issue of duplicated effort that is a little puzzling. The database table `compute_nodes' is being updated periodically with data about capabilities and resources used (memory, vcpus, ...) while at the same time a periodic RPC call is being made to the scheduler sending pretty much the same data.
>>>> Does anyone know why we are updating the same data in two different place using two different mechanisms? Also, assuming we were to remove one of these updates, which one should go? (I thought at one point in time there was a goal to create a database free compute node which would imply we should remove the DB update.)
>>> Have you looked around to see if any code is using the data from the db?
>>> Having schedulers hit the db for the current state of all compute nodes
>>> all of the time would be a large additional db burden that I think we
>>> should avoid. So, it makes sense to keep the rpc fanout_cast of current
>>> stats to schedulers.
>> This is actually what the scheduler uses. :) The fanout messages are too infrequent and can be too laggy. So, the scheduler was moved to using the DB a long, long time ago… but it was very inefficient, at first, because it looped through all instances. So we added things we needed into compute_node and compute_node_stats so we only had to look at the hosts. You have to pull the hosts anyway, so we pull the stats at the same time.
>> The problem is… when we stopped using certain data from the fanout messages…. we never removed it. We should AT LEAST do this. But.. (see below)..
>>> The scheduler also does a fanout_cast to all compute nodes when it
>>> starts up to trigger the compute nodes to populate the cache in the
>>> scheduler. It would be nice to never fanout_cast to all compute nodes
>>> (given that there may be a *lot* of them). We could replace this with
>>> having the scheduler populate its cache from the database.
>> I think we should audit the remaining things that the scheduler uses from these messages and move them to the DB. I believe it's limited to the hypervisor capabilities to compare against aggregates or some such. I believe it's things that change very rarely… so an alternative can be to only send fanout messages when capabilities change! We could always do that as a first step.
>>> Removing the db usage completely would be nice if nothing is actually
>>> using it, but we'd have to look into an alternative solution for
>>> removing the scheduler fanout_cast to compute.
>> Relying on anything but the DB for current memory free, etc, is just too laggy… so we need to stick with it, IMO.
>> - Chris
>> OpenStack-dev mailing list
>> OpenStack-dev at lists.openstack.org
> As Chris said, the reason it ended up this way using the DB is to quickly get up to date usage on hosts to the scheduler. I certainly understand the point that it's a whole lot of increased load on the DB, but the RPC data was quite stale. If there is interest in moving away from the DB updates, I think we have to either:
> 1) Send RPC updates to scheduler on essentially every state change during a build.
> 2) Change the scheduler architecture so there is some "memory" of resources consumed between requests. The scheduler would have to remember which hosts recent builds were assigned to. This could be a bit of a data synchronization problem. if you're talking about using multiple scheduler instances.
Thanks for the feedback. Neither of these sound too attractive to me.
I think Chris' comment to audit the usage of the fanout messages and get
rid of them sounds like the best way forward to clean this up.
More information about the OpenStack-dev