[openstack-dev] [Nova] [Gantt][Scheduler-split] Why we need a Smart Placement Engine as a Service! (was: Scheduler split status (updated))

Yathiraj Udupi (yudupi) yudupi at cisco.com
Tue Jul 15 02:25:41 UTC 2014


Hi all,

Adding to the interesting discussion thread regarding the scheduler split and its importance, I would like to pitch in a couple of thoughts in favor of Gantt.  It was in the Icehouse summit in HKG in one of the scheduler design sessions, I along with a few others (cc’d) pitched a session on Smart Resource Placement (https://etherpad.openstack.org/p/NovaIcehouse-Smart-Resource-Placement), where we pitched for a  Smart Placement Decision Engine  as a Service , addressing cross-service scheduling as one of the use cases.  We pitched the idea as to how a stand-alone service can act as a  smart resource placement engine, (see figure: https://docs.google.com/drawings/d/1BgK1q7gl5nkKWy3zLkP1t_SNmjl6nh66S0jHdP0-zbY/edit?pli=1) that can use state data from all the services, and make a unified placement decision.   We even have proposed a separate blueprint (https://blueprints.launchpad.net/nova/+spec/solver-scheduler with working code now here: https://github.com/CiscoSystems/nova-solver-scheduler) called Smart Scheduler (Solver Scheduler), which has the goals of being able to do smart resource placement taking into account complex constraints incorporating compute(nova), storage(cinder), and network constraints.   The existing Filter Scheduler or the projects like Smart (Solver) Scheduler (for covering the complex constraints scenarios) could easily fulfill the decision making aspects of the placement engine.

I believe the Gantt project is the right direction in terms of separating out the placement decision concern, and creating a separate scheduler as a service, so that it can freely talk to any of the other services, or use a unified global state repository and make the unified decision.  Projects like Smart(Solver) Scheduler can easily fit into the Gantt Project as pluggable drivers to add the additional smarts required.

To make our Smart Scheduler as a service, we currently have prototyped this Scheduler as a service providing a RESTful interface to the smart scheduler, that is detached from Nova (loosely connected):
For example a RESTful request like this (where I am requests for 2 Vms, with a requirement of 1 GB disk, and another request for 1 Vm of flavor ‘m1.tiny’, but also has a special requirement that it should be close to the volume with uuid: “ef6348300bc511e4bc4cc03fd564d1bc" (Compute-Volume affinity constraint)) :


curl -i -H "Content-Type: application/json" -X POST -d '{"instance_requests": [{"num_instances": 2, "request_properties": {"instance_type": {"root_gb": 1}}}, {"num_instances": 1, "request_properties": {"flavor": "m1.tiny”, “volume_affinity": "ef6348300bc511e4bc4cc03fd564d1bc"}}]}' http://<x.x.x.x>/smart-scheduler-as-a-service/v1.0/placement


provides a placement decision something like this:

{

  "result": [

    [

      {

        "host": {

          "host": "Host1",

          "nodename": "Node1"

        },

        "instance_uuid": "VM_ID_0_0"

      },

      {

        "host": {

          "host": "Host2",

          "nodename": "Node2"

        },

        "instance_uuid": "VM_ID_0_1"

      }

    ],

    [

      {

        "host": {

          "host": "Host1",

          "nodename": "Node1"

        },

        "instance_uuid": "VM_ID_1_0"

      }

    ]

  ]

}


This placement result can be used by Nova to proceed and complete the scheduling.


This is where I see the potential for Gantt, which will be a stand alone placement decision engine, and can easily accommodate different pluggable engines (such as Smart Scheduler (https://blueprints.launchpad.net/nova/+spec/solver-scheduler))  to do smart placement decisions.


Pointers:
Smart Resource Placement overview: https://docs.google.com/document/d/1IiPI0sfaWb1bdYiMWzAAx0HYR6UqzOan_Utgml5W1HI/edit?pli=1
Figure: https://docs.google.com/drawings/d/1BgK1q7gl5nkKWy3zLkP1t_SNmjl6nh66S0jHdP0-zbY/edit?pli=1
Nova Design Session Etherpad: https://etherpad.openstack.org/p/NovaIcehouse-Smart-Resource-Placement
https://etherpad.openstack.org/p/IceHouse-Nova-Scheduler-Sessions
Smart Scheduler Blueprint: https://blueprints.launchpad.net/nova/+spec/solver-scheduler
Working code: https://github.com/CiscoSystems/nova-solver-scheduler



Thanks,

Yathi.





On 7/14/14, 1:40 PM, "Murray, Paul (HP Cloud)" <pmurray at hp.com<mailto:pmurray at hp.com>> wrote:

Hi All,

I’m sorry I am so late to this lively discussion – it looks a good one! Jay has been driving the debate a bit so most of this is in response to his comments. But please, anyone should chip in.

On extensible resource tracking

Jay, I am surprised to hear you say no one has explained to you why there is an extensible resource tracking blueprint. It’s simple, there was a succession of blueprints wanting to add data about this and that to the resource tracker and the scheduler and the database tables used to communicate. These included capabilities, all the stuff in the stats, rxtx_factor, the equivalent for cpu (which only works on one hypervisor I think), pci_stats and more were coming including,

https://blueprints.launchpad.net/nova/+spec/network-bandwidth-entitlement
https://blueprints.launchpad.net/nova/+spec/cpu-entitlement

So, in short, your claim that there are no operators asking for additional stuff is simply not true.

Around about the Icehouse summit (I think) it was suggested that we should stop the obvious trend and add a way to make resource tracking extensible, similar to metrics, which had just been added as an extensible way of collecting on going usage data (because that was also wanted).

The json blob you refer to was down to the bad experience of the compute_node_stats table implemented for stats – which had a particular performance hit because it required an expensive join. This was dealt with by removing the table and adding a string field to contain the data as a json blob. A pure performance optimization. Clearly there is no need to store things in this way and with Nova objects being introduced there is a means to provide strict type checking on the data even if it is stored as json blobs in the database.

On scheduler split

I have no particular position on splitting the scheduler. However, there was an interesting reaction to the network bandwidth entitlement blueprint listed above. The nova community felt it was a network thing and so nova should not provide it – neutron should. Of course, in nova, the nova scheduler makes placement decisions… can you see where this is going…? Nova needs to coordinate its placement decision with neutron to decide if a host has sufficient bandwidth available. Similar points are made about cinder – nova has no idea about cinder, but in some environments the location of a volume matters when you come to place an instance.

I should re-iterate that I have no position on splitting out the scheduler, but some way to deal with information from outside nova is certainly desirable. Maybe other services have the same dilemma.

On global resource tracker

I have to say I am inclined to be against the idea of turning the scheduler into a “global resource tracker”. I do see the benefit of obtaining a resource claim up front, we have all seen that the scheduler can make incorrect choices because of the delay in reflecting resource allocation to the database and so to the scheduler – it operates on imperfect information. However, it is best to avoid a global service relying on synchronous interaction with compute nodes during the process of servicing a request. I have looked at your example code for the scheduler (global resource tracker) and it seems to make a choice from local information and then interact with the chosen compute node to obtain a claim and then try again if the claim fails. I get it – I see that it deals with the same list of hosts on the retry. I also see it has no better chance of getting it right.

Your desire to have a claim is borne out by the persistent claims spec (I love the spec, I really I don’t see why they have to be persistent). I think that is a great idea. Why not let the scheduler make placement suggestions (as a global service) and then allow conductors to obtain the claim and retry if the claim fails? Similar process to your code, but the scheduler only does its part and the conductors scale out the process by acting more locally and with more parallelism. (Of course, you could also be optimistic and allow the compute node to do the claim as part of the create as the degenerate case).

To emphasize the point further, what would a cells scheduler do? Would that also make a synchronous operation to obtain the claim?

My reaction to the global resource tracker idea has been quite negative. I want to like the idea because I like the thought of knowing I have the resources when I get my answer. Its just that I think the persistent claims (without the persistent part :) ) gives us a lot of what we need. But I am still open to be convinced.

Paul



On 07/14/2014 10:16 AM, Sylvain Bauza wrote:
> Le 12/07/2014 06:07, Jay Pipes a écrit :
>> On 07/11/2014 07:14 AM, John Garbutt wrote:
>>> On 10 July 2014 16:59, Sylvain Bauza <sbauza at redhat.com> wrote:
>>>> Le 10/07/2014 15:47, Russell Bryant a écrit :
>>>>> On 07/10/2014 05:06 AM, Sylvain Bauza wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> === tl;dr: Now that we agree on waiting for the split
>>>>>> prereqs to be done, we debate on if ResourceTracker should
>>>>>> be part of the scheduler code and consequently Scheduler
>>>>>> should expose ResourceTracker APIs so that Nova wouldn't
>>>>>> own compute nodes resources. I'm proposing to first come
>>>>>> with RT as Nova resource in Juno and move ResourceTracker
>>>>>> in Scheduler for K, so we at least merge some patches by
>>>>>> Juno. ===
>>>>>>
>>>>>> Some debates occured recently about the scheduler split, so
>>>>>> I think it's important to loop back with you all to see
>>>>>> where we are and what are the discussions. Again, feel free
>>>>>> to express your opinions, they are welcome.
>>>>> Where did this resource tracker discussion come up?  Do you
>>>>> have any references that I can read to catch up on it?  I
>>>>> would like to see more detail on the proposal for what should
>>>>> stay in Nova vs. be moved.  What is the interface between
>>>>> Nova and the scheduler here?
>>>>
>>>> Oh, missed the most important question you asked. So, about
>>>> the interface in between scheduler and Nova, the original
>>>> agreed proposal is in the spec
>>>> https://review.openstack.org/82133 (approved) where the
>>>> Scheduler exposes : - select_destinations() : for querying the
>>>> scheduler to provide candidates - update_resource_stats() : for
>>>> updating the scheduler internal state (ie. HostState)
>>>>
>>>> Here, update_resource_stats() is called by the
>>>> ResourceTracker, see the implementations (in review)
>>>> https://review.openstack.org/82778 and
>>>> https://review.openstack.org/104556.
>>>>
>>>> The alternative that has just been raised this week is to
>>>> provide a new interface where ComputeNode claims for resources
>>>> and frees these resources, so that all the resources are fully
>>>> owned by the Scheduler. An initial PoC has been raised here
>>>> https://review.openstack.org/103598 but I tried to see what
>>>> would be a ResourceTracker proxified by a Scheduler client here
>>>> : https://review.openstack.org/105747. As the spec hasn't been
>>>> written, the names of the interfaces are not properly defined
>>>> but I made a proposal as : - select_destinations() : same as
>>>> above - usage_claim() : claim a resource amount -
>>>> usage_update() : update a resource amount - usage_drop(): frees
>>>> the resource amount
>>>>
>>>> Again, this is a dummy proposal, a spec has to written if we
>>>> consider moving the RT.
>>>
>>> While I am not against moving the resource tracker, I feel we
>>> could move this to Gantt after the core scheduling has been
>>> moved.
>>
>> Big -1 from me on this, John.
>>
>> Frankly, I see no urgency whatsoever -- and actually very little
>> benefit -- to moving the scheduler out of Nova. The Gantt project I
>> think is getting ahead of itself by focusing on a split instead of
>> focusing on cleaning up the interfaces between nova-conductor,
>> nova-scheduler, and nova-compute.
>>
>
> -1 on saying there is no urgency. Don't you see the NFV group saying
> each meeting what is the status of the scheduler split ?

Frankly, I don't think a lot of the NFV use cases are well-defined.

Even more frankly, I don't see any benefit to a split-out scheduler to a
single NFV use case.

> Don't you see each Summit the lots of talks (and people attending
> them) talking about how OpenStack should look at Pets vs. Cattle and
> saying that the scheduler should be out of Nova ?

There's been no concrete benefits discussed to having the scheduler
outside of Nova.

I don't really care how many people say that the scheduler should be out
of Nova unless those same people come to the table with concrete reasons
why. Just saying something is a benefit does not make it a benefit, and
I think I've outlined some of the very real dangers -- in terms of code
and payload complexity -- of breaking the scheduler out of Nova until
the interfaces are cleaned up and the scheduler actually owns the
resources upon which it exercises placement decisions.

> From an operator perspective, people waited so long for having a
> scheduler doing "scheduling" and not only "resource placement".

Could you elaborate a bit here? What operators are begging for the
scheduler to do more than resource placement? And if they are begging
for this, what use cases are they trying to address?

I'm genuinely curious, so looking forward to your reply here! :)

snip...

>> As for the idea that things will get *easier* once scheduler code
>> is broken out of Nova, I go back to my original statement that I
>> don't really see the benefit of the split at this point, and I
>> would just bring up the fact that Neutron/nova-network is a shining
>> example of how things can easily backfire when splitting of code is
>> done too early before interfaces are cleaned up and
>> responsibilities between internal components are not clearly agreed
>> upon.
>
> Please, please, don't mix the rationale for extensible Resource
> Tracker and the current efforts for moving out the Scheduler. Both of
> them try to have an agnostic and heterogeneous scheduler, but both
> efforts are independent.
>
> The ResourceTracker is something pure Nova. Saying to Gantt "I want
> to store this data" and "I want you to select a destination" is
> something enough agnostic for not including the port of
> ResourceTracker to the Scheduler.

Sorry, I'm not following you. Who is saying to Gantt "I want to store
this data"?

All I am saying is that the thing that places a resource on some
provider of that resource should be the thing that owns the process of a
requester *claiming* the resources on that provider, and in order to
properly track resources in a race-free way in such a system, then the
system needs to contain the resource tracker.

> While I approve to define the interfaces now, there is no reason tho
> to say we would have to change anything in how Nova is doing that.
> The role of Gantt is to define the interfaces, make the line
> Scheduler vs. Nova and forklift the Scheduler into a single project.
> No big bang is needed here.

Yeah, I just don't see the need to split the scheduler at this point,
sorry. :(

Best,
-jay
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140715/3b115d1b/attachment-0001.html>


More information about the OpenStack-dev mailing list