[openstack-dev] [heat] [nova] How should a holistic scheduler relate to Heat?

Mike Spreitzer mspreitz at us.ibm.com
Fri Apr 4 06:42:29 UTC 2014


Clint Byrum <clint at fewbar.com> wrote on 04/03/2014 07:01:16 PM:

> ... The whole question raises many more
> questions, and I wonder if there's just something you haven't told us
> about this use case. :-P

Yes, I seem to have made a muddle of things by starting in one corner of a 
design space.  Let me try to reset this conversation and start from the 
beginning and go slowly enough.  I have adjusted the email subject line to 
describe the overall discussion and invite Nova people, who should also 
participate because this involves the evolution of the Nova API.

Let's start with the simple exercise of designing a resource type for the 
existing server-groups feature of Nova, and then consider how to take one 
evolutionary step forward (from sequential to holistic scheduling).  By 
"scheduling" here I mean simply placement, not a more sophisticated thing 
that includes time as well.

The server-groups feature of Nova (
https://blueprints.launchpad.net/nova/+spec/instance-group-api-extension) 
allows a Nova client to declare a group (just the group as a thing unto 
itself, not listing its members) and associate placement policies with it, 
and include a reference to the group in each Nova API call that creates a 
member of the group --- thereby putting those instances in that group, for 
the purpose of letting the scheduling for those instances take the group's 
policies into account.  The policies currently supported are affinity and 
anti-affinity.  This does what might be called sequential scheduling: when 
an instance is created, its placement decision can take into account its 
group's policies and the placement decisions already made for instances 
previously created, but cannot take into account the issues of placing 
instances that have yet to be created.

We can define a Heat resource type for a server-group.  Such a resource 
would include its policy set, and not its members, among its properties. 
In the Heat snippet for an OS::Nova::Server there could be a reference to 
a server-group resource.  This directly reflects the API outlined above, 
the dependencies run in the right direction for that API, and it looks to 
me like a pretty simple and clear design. Do not ask me whether a 
server-group's attributes include its members.

If the only placement policies are anti-affinity policies and all servers 
are eligible for the same places then I think that there is no advantage 
in scheduling holistically.  But I am interested in a broader set of 
scenarios, and for those holistic scheduling can get better results than 
sequential scheduling in some cases.

Now let us consider how to evolve the Nova API so that a server-group can 
be scheduled holistically.  That is, we want to enable the scheduler to 
look at both the group's policies and its membership, all at once, and 
make a joint decision about how to place all the servers (instances) in 
the group.  There is no agreed answer here yet, but let me suggest one 
that I hope can move this discussion forward.  The key idea is to first 
associate not just the policies but also a description of the group's 
members with the group, then get the joint scheduling decision made, then 
let the client orchestrate the actual creation of the servers.  This could 
be done with a two-step API: one step creates the group, given its 
policies and member descriptions, and in the second step the client makes 
the calls that cause the individual servers to be made; as before, each 
such call includes a reference to the group --- which is now associated 
(under the covers) with a table that lists the chosen placement for each 
server.  The server descriptions needed in the first step are not as 
extensive as the descriptions needed in the second step.  For example, the 
holistic scheduler would not care about the user_data of a server.  We 
could define a new data structure for member descriptions used in the 
first step (this would probably be a pared-down version of what is used in 
the second step).

Now let us consider how to expose this through Heat.  We could take a 
direct approach: modify our original server-group resource type so that 
its properties include not only the policy set but also the list of member 
descriptions, and the rest remains unchanged.  That would work, but it 
would be awkward for template authors.  They now have to write two 
descriptions of each server --- with no help at authoring time for 
ensuring the requisite consistency between the two descriptions.  Of 
course, the Nova API is no better regarding consistency, it can (at best) 
check for consistency when it sees the second description of a given 
server.  But the Nova API is imperative, while a Heat template is intended 
to be declarative.  I do not like double description because it adds bulk 
and creates additional opportunities for mistakes (compared to single 
description).

How can we avoid double-description?  A few ideas come to mind.

One approach involves a change in the Heat engine's framework: allow a 
resource type plugin to navigate the resource graph to look at related 
resources.  Suppose the implementation of a server-group resource can 
navigate to and read the Heat descriptions of its members to compute the 
member description list needed by my hypothesized new Nova API.  That 
allows the template to continue to hold a single description, and the 
nominal dependencies run in the right direction (the members are created 
after the group has made its joint decision).

Another approach involves a more pervasive change to the heat engine's 
framework (or maybe no change, I am not familiar with initialization) so 
that there are two passes over the graph.  Keeping dependencies in the 
same direction as before, it could work as follows.  In the first pass: 
there is first some representation of the server-group created, and then a 
description of each member is associated with the group.  In the second 
pass: the hypothesized new Nova API for creating the server-group given 
both its policies and member descriptions is called, then the 
orchestration of the group's members happens (carrying references to the 
group).

Another approach would be to put the holistic scheduling entirely prior to 
Heat; let something else solve the holistic scheduling problem and emit 
Heat templates that include or refer to scheduling decisions that have 
already (by the time there is a Heat template) been made.  That could 
work, if we really restrict our attention to explicit server-groups only. 
But we will also want to give server-group behavior to an autoscaling 
group.  That is, allow an autoscaling group to be given a set of placement 
policies to apply just like a server-group can be given policies. 
Similarly, we will we want to allow the workers of a Hadoop cluster in 
Sahara to have server-group behavior.  Trying to keep holistic scheduling 
prior to Heat will involve abstraction-breaking (maybe it would be best 
approached as abstraction factoring) so that the prior scheduler can make 
the decisions required for the ASG and Sahara abstractions (and so on). 
But as an autoscaling group or Hadoop worker cluster or whatever 
autonomously increases its size, the holistic scheduler should be 
consulted.  So we will also have holistic scheduling that comes after Heat 
as well as before.  Also consider what does the input to the holistic 
scheduler look like, and what tools will emit and otherwise process that 
input?  If it looks nothing like a heat template, then a given producer or 
consumer will be dedicated to either stuff with holistic scheduling or 
stuff without it.  But it is pretty natural for the input to a holistic 
scheduler to look like a heat template augmented with policies, because 
that's pretty much what is needed; this could unify an otherwise split 
ecosystem of tools.

What do you think?

Thanks,
Mike
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140404/10e1f4af/attachment.html>


More information about the OpenStack-dev mailing list