[openstack-dev] [heat] [nova] How should a holistic scheduler relate to Heat?
Mike Spreitzer
mspreitz at us.ibm.com
Fri Apr 4 06:42:29 UTC 2014
Clint Byrum <clint at fewbar.com> wrote on 04/03/2014 07:01:16 PM:
> ... The whole question raises many more
> questions, and I wonder if there's just something you haven't told us
> about this use case. :-P
Yes, I seem to have made a muddle of things by starting in one corner of a
design space. Let me try to reset this conversation and start from the
beginning and go slowly enough. I have adjusted the email subject line to
describe the overall discussion and invite Nova people, who should also
participate because this involves the evolution of the Nova API.
Let's start with the simple exercise of designing a resource type for the
existing server-groups feature of Nova, and then consider how to take one
evolutionary step forward (from sequential to holistic scheduling). By
"scheduling" here I mean simply placement, not a more sophisticated thing
that includes time as well.
The server-groups feature of Nova (
https://blueprints.launchpad.net/nova/+spec/instance-group-api-extension)
allows a Nova client to declare a group (just the group as a thing unto
itself, not listing its members) and associate placement policies with it,
and include a reference to the group in each Nova API call that creates a
member of the group --- thereby putting those instances in that group, for
the purpose of letting the scheduling for those instances take the group's
policies into account. The policies currently supported are affinity and
anti-affinity. This does what might be called sequential scheduling: when
an instance is created, its placement decision can take into account its
group's policies and the placement decisions already made for instances
previously created, but cannot take into account the issues of placing
instances that have yet to be created.
We can define a Heat resource type for a server-group. Such a resource
would include its policy set, and not its members, among its properties.
In the Heat snippet for an OS::Nova::Server there could be a reference to
a server-group resource. This directly reflects the API outlined above,
the dependencies run in the right direction for that API, and it looks to
me like a pretty simple and clear design. Do not ask me whether a
server-group's attributes include its members.
If the only placement policies are anti-affinity policies and all servers
are eligible for the same places then I think that there is no advantage
in scheduling holistically. But I am interested in a broader set of
scenarios, and for those holistic scheduling can get better results than
sequential scheduling in some cases.
Now let us consider how to evolve the Nova API so that a server-group can
be scheduled holistically. That is, we want to enable the scheduler to
look at both the group's policies and its membership, all at once, and
make a joint decision about how to place all the servers (instances) in
the group. There is no agreed answer here yet, but let me suggest one
that I hope can move this discussion forward. The key idea is to first
associate not just the policies but also a description of the group's
members with the group, then get the joint scheduling decision made, then
let the client orchestrate the actual creation of the servers. This could
be done with a two-step API: one step creates the group, given its
policies and member descriptions, and in the second step the client makes
the calls that cause the individual servers to be made; as before, each
such call includes a reference to the group --- which is now associated
(under the covers) with a table that lists the chosen placement for each
server. The server descriptions needed in the first step are not as
extensive as the descriptions needed in the second step. For example, the
holistic scheduler would not care about the user_data of a server. We
could define a new data structure for member descriptions used in the
first step (this would probably be a pared-down version of what is used in
the second step).
Now let us consider how to expose this through Heat. We could take a
direct approach: modify our original server-group resource type so that
its properties include not only the policy set but also the list of member
descriptions, and the rest remains unchanged. That would work, but it
would be awkward for template authors. They now have to write two
descriptions of each server --- with no help at authoring time for
ensuring the requisite consistency between the two descriptions. Of
course, the Nova API is no better regarding consistency, it can (at best)
check for consistency when it sees the second description of a given
server. But the Nova API is imperative, while a Heat template is intended
to be declarative. I do not like double description because it adds bulk
and creates additional opportunities for mistakes (compared to single
description).
How can we avoid double-description? A few ideas come to mind.
One approach involves a change in the Heat engine's framework: allow a
resource type plugin to navigate the resource graph to look at related
resources. Suppose the implementation of a server-group resource can
navigate to and read the Heat descriptions of its members to compute the
member description list needed by my hypothesized new Nova API. That
allows the template to continue to hold a single description, and the
nominal dependencies run in the right direction (the members are created
after the group has made its joint decision).
Another approach involves a more pervasive change to the heat engine's
framework (or maybe no change, I am not familiar with initialization) so
that there are two passes over the graph. Keeping dependencies in the
same direction as before, it could work as follows. In the first pass:
there is first some representation of the server-group created, and then a
description of each member is associated with the group. In the second
pass: the hypothesized new Nova API for creating the server-group given
both its policies and member descriptions is called, then the
orchestration of the group's members happens (carrying references to the
group).
Another approach would be to put the holistic scheduling entirely prior to
Heat; let something else solve the holistic scheduling problem and emit
Heat templates that include or refer to scheduling decisions that have
already (by the time there is a Heat template) been made. That could
work, if we really restrict our attention to explicit server-groups only.
But we will also want to give server-group behavior to an autoscaling
group. That is, allow an autoscaling group to be given a set of placement
policies to apply just like a server-group can be given policies.
Similarly, we will we want to allow the workers of a Hadoop cluster in
Sahara to have server-group behavior. Trying to keep holistic scheduling
prior to Heat will involve abstraction-breaking (maybe it would be best
approached as abstraction factoring) so that the prior scheduler can make
the decisions required for the ASG and Sahara abstractions (and so on).
But as an autoscaling group or Hadoop worker cluster or whatever
autonomously increases its size, the holistic scheduler should be
consulted. So we will also have holistic scheduling that comes after Heat
as well as before. Also consider what does the input to the holistic
scheduler look like, and what tools will emit and otherwise process that
input? If it looks nothing like a heat template, then a given producer or
consumer will be dedicated to either stuff with holistic scheduling or
stuff without it. But it is pretty natural for the input to a holistic
scheduler to look like a heat template augmented with policies, because
that's pretty much what is needed; this could unify an otherwise split
ecosystem of tools.
What do you think?
Thanks,
Mike
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140404/10e1f4af/attachment.html>
More information about the OpenStack-dev
mailing list