[openstack-dev] Scheduler proposal

Joshua Harlow harlowja at outlook.com
Thu Oct 8 17:37:50 UTC 2015


Clint Byrum wrote:
> Excerpts from Joshua Harlow's message of 2015-10-08 08:38:57 -0700:
>> Joshua Harlow wrote:
>>> On Thu, 8 Oct 2015 10:43:01 -0400
>>> Monty Taylor<mordred at inaugust.com>   wrote:
>>>
>>>> On 10/08/2015 09:01 AM, Thierry Carrez wrote:
>>>>> Maish Saidel-Keesing wrote:
>>>>>> Operational overhead has a cost - maintaining 3 different database
>>>>>> tools, backing them up, providing HA, etc. has operational cost.
>>>>>>
>>>>>> This is not to say that this cannot be overseen, but it should be
>>>>>> taken into consideration.
>>>>>>
>>>>>> And *if* they can be consolidated into an agreed solution across
>>>>>> the whole of OpenStack - that would be highly beneficial (IMHO).
>>>>> Agreed, and that ties into the similar discussion we recently had
>>>>> about picking a common DLM. Ideally we'd only add *one* general
>>>>> dependency and use it for locks / leader election / syncing status
>>>>> around.
>>>>>
>>>> ++
>>>>
>>>> All of the proposed DLM tools can fill this space successfully. There
>>>> is definitely not a need for multiple.
>>> On this point, and just thinking out loud. If we consider saving
>>> compute_node information into say a node in said DLM backend (for
>>> example a znode in zookeeper[1]); this information would be updated
>>> periodically by that compute_node *itself* (it would say contain
>>> information about what VMs are running on it, what there utilization is
>>> and so-on).
>>>
>>> For example the following layout could be used:
>>>
>>> /nova/compute_nodes/<hypervisor-hostname>
>>>
>>> <hypervisor-hostname>   data could be:
>>>
>>> {
>>>       vms: [],
>>>       memory_free: XYZ,
>>>       cpu_usage: ABC,
>>>       memory_used: MNO,
>>>       ...
>>> }
>>>
>>> Now if we imagine each/all schedulers having watches
>>> on /nova/compute_nodes/ ([2] consul and etc.d have equivalent concepts
>>> afaik) then when a compute_node updates that information a push
>>> notification (the watch being triggered) will be sent to the
>>> scheduler(s) and the scheduler(s) could then update a local in-memory
>>> cache of the data about all the hypervisors that can be selected from
>>> for scheduling. This avoids any reading of a large set of data in the
>>> first place (besides an initial read-once on startup to read the
>>> initial list + setup the watches); in a way its similar to push
>>> notifications. Then when scheduling a VM ->   hypervisor there isn't any
>>> need to query anything but the local in-memory representation that the
>>> scheduler is maintaining (and updating as watches are triggered)...
>>>
>>> So this is why I was wondering about what capabilities of cassandra are
>>> being used here; because the above I think are unique capababilties of
>>> DLM like systems (zookeeper, consul, etcd) that could be advantageous
>>> here...
>>>
>>> [1]
>>> https://zookeeper.apache.org/doc/trunk/zookeeperProgrammers.html#sc_zkDataModel_znodes
>>>
>>> [2]
>>> https://zookeeper.apache.org/doc/trunk/zookeeperProgrammers.html#ch_zkWatches
>>>
>>>
>> And here's a final super-awesomeness,
>>
>> Use the same existence of that znode + information (perhaps using
>> ephemeral znodes or equivalent) to determine if a hypervisor is 'alive'
>> or 'dead', thus removing the need to do queries and periodic writes to
>> the nova database to determine if a hypervisors nova-compute service is
>> alive or dead (with reads via
>> https://github.com/openstack/nova/blob/master/nova/servicegroup/drivers/db.py#L33
>> and other similar code scattered in nova)...
>>
>
> ^^ THIS is the kind of architectural thinking I'd like to see us do more
> of.
>
> This isn't "hey I have a better database" it is "I have a way to reduce
> the most common operations to O(1) complexity".
>
> Ed, for all of the promise of your experiment, I'd actually rather see
> time spent on Josh's idea above. In fact, I might spend time on Josh's
> idea above. :)

Go for it!

We (at yahoo) are also brainstorming this idea (or something like it), 
and as we hit more performance issues pushing the 1000+ hypervisors in a 
single cluster (no cell/s) (one of our many cluster/s) we will start 
adjusting (and hopefully more blogging, upstreaming and all that) what 
needs to be fixed/tweaked/altered to continue to push these boundaries.

Collab. and all that is welcome to of course :)

P.S.

The DLM spec @ https://review.openstack.org/#/c/209661/ (rendered nicely 
at 
http://docs-draft.openstack.org/61/209661/29/check/gate-openstack-specs-docs/2ff62fa//doc/build/html/specs/chronicles-of-a-dlm.html) 
mentions 'Such a consensus being built will also influence the future 
functionality and capabilities of OpenStack at large so we need to be 
especially careful, thoughtful, and explicit here.'

This statement was really targeted at cases like this, when we (as a 
community) choose a DLM solution we affect the larger capabilities of 
openstack, not just for locking but for scheduling (and likely for other 
functionality I can't even think of/predict...)

>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



More information about the OpenStack-dev mailing list