<tt><font size=2>Clint Byrum <clint@fewbar.com> wrote on 04/03/2014

07:01:16 PM:<br>

<br>

> ... The whole question raises many more<br>

> questions, and I wonder if there's just something you haven't told

us<br>

> about this use case. :-P<br>

</font></tt>

<br><tt><font size=2>Yes, I seem to have made a muddle of things by starting

in one corner of a design space.  Let me try to reset this conversation

and start from the beginning and go slowly enough.  I have adjusted

the email subject line to describe the overall discussion and invite Nova

people, who should also participate because this involves the evolution

of the Nova API.</font></tt>

<br>

<br><tt><font size=2>Let's start with the simple exercise of designing

a resource type for the existing server-groups feature of Nova, and then

consider how to take one evolutionary step forward (from sequential to

holistic scheduling).  By "scheduling" here I mean simply

placement, not a more sophisticated thing that includes time as well.</font></tt>

<br>

<br><tt><font size=2>The server-groups feature of Nova (</font></tt><a href="https://blueprints.launchpad.net/nova/+spec/instance-group-api-extension"><tt><font size=2>https://blueprints.launchpad.net/nova/+spec/instance-group-api-extension</font></tt></a><tt><font size=2>)

allows a Nova client to declare a group (just the group as a thing unto

itself, not listing its members) and associate placement policies with

it, and include a reference to the group in each Nova API call that creates

a member of the group --- thereby putting those instances in that group,

for the purpose of letting the scheduling for those instances take the

group's policies into account.  The policies currently supported are

affinity and anti-affinity.  This does what might be called sequential

scheduling: when an instance is created, its placement decision can take

into account its group's policies and the placement decisions already made

for instances previously created, but cannot take into account the issues

of placing instances that have yet to be created.</font></tt>

<br>

<br><tt><font size=2>We can define a Heat resource type for a server-group.

 Such a resource would include its policy set, and not its members,

among its properties.  In the Heat snippet for an OS::Nova::Server

there could be a reference to a server-group resource.  This directly

reflects the API outlined above, the dependencies run in the right direction

for that API, and it looks to me like a pretty simple and clear design.

Do not ask me whether a server-group's attributes include its members.</font></tt>

<br>

<br><tt><font size=2>If the only placement policies are anti-affinity policies

and all servers are eligible for the same places then I think that there

is no advantage in scheduling holistically.  But I am interested in

a broader set of scenarios, and for those holistic scheduling can get better

results than sequential scheduling in some cases.</font></tt>

<br>

<br><tt><font size=2>Now let us consider how to evolve the Nova API so

that a server-group can be scheduled holistically.  That is, we want

to enable the scheduler to look at both the group's policies and its membership,

all at once, and make a joint decision about how to place all the servers

(instances) in the group.  There is no agreed answer here yet, but

let me suggest one that I hope can move this discussion forward.  The

key idea is to first associate not just the policies but also a description

of the group's members with the group, then get the joint scheduling decision

made, then let the client orchestrate the actual creation of the servers.

 This could be done with a two-step API: one step creates the group,

given its policies and member descriptions, and in the second step the

client makes the calls that cause the individual servers to be made; as

before, each such call includes a reference to the group --- which is now

associated (under the covers) with a table that lists the chosen placement

for each server.  The server descriptions needed in the first step

are not as extensive as the descriptions needed in the second step.  For

example, the holistic scheduler would not care about the user_data of a

server.  We could define a new data structure for member descriptions

used in the first step (this would probably be a pared-down version of

what is used in the second step).</font></tt>

<br>

<br><tt><font size=2>Now let us consider how to expose this through Heat.

 We could take a direct approach: modify our original server-group

resource type so that its properties include not only the policy set but

also the list of member descriptions, and the rest remains unchanged.  That

would work, but it would be awkward for template authors.  They now

have to write two descriptions of each server --- with no help at authoring

time for ensuring the requisite consistency between the two descriptions.

 Of course, the Nova API is no better regarding consistency, it can

(at best) check for consistency when it sees the second description of

a given server.  But the Nova API is imperative, while a Heat template

is intended to be declarative.  I do not like double description because

it adds bulk and creates additional opportunities for mistakes (compared

to single description).</font></tt>

<br>

<br><tt><font size=2>How can we avoid double-description?  A few ideas

come to mind.</font></tt>

<br>

<br><tt><font size=2>One approach involves a change in the Heat engine's

framework: allow a resource type plugin to navigate the resource graph

to look at related resources.  Suppose the implementation of a server-group

resource can navigate to and read the Heat descriptions of its members

to compute the member description list needed by my hypothesized new Nova

API.  That allows the template to continue to hold a single description,

and the nominal dependencies run in the right direction (the members are

created after the group has made its joint decision).</font></tt>

<br>

<br><tt><font size=2>Another approach involves a more pervasive change

to the heat engine's framework (or maybe no change, I am not familiar with

initialization) so that there are two passes over the graph.  Keeping

dependencies in the same direction as before, it could work as follows.

 In the first pass: there is first some representation of the server-group

created, and then a description of each member is associated with the group.

 In the second pass: the hypothesized new Nova API for creating the

server-group given both its policies and member descriptions is called,

then the orchestration of the group's members happens (carrying references

to the group).</font></tt>

<br>

<br><tt><font size=2>Another approach would be to put the holistic scheduling

entirely prior to Heat; let something else solve the holistic scheduling

problem and emit Heat templates that include or refer to scheduling decisions

that have already (by the time there is a Heat template) been made.  That

could work, if we really restrict our attention to explicit server-groups

only.  But we will also want to give server-group behavior to an autoscaling

group.  That is, allow an autoscaling group to be given a set of placement

policies to apply just like a server-group can be given policies.  Similarly,

we will we want to allow the workers of a Hadoop cluster in Sahara to have

server-group behavior.  Trying to keep holistic scheduling prior to

Heat will involve abstraction-breaking (maybe it would be best approached

as abstraction factoring) so that the prior scheduler can make the decisions

required for the ASG and Sahara abstractions (and so on).  But as

an autoscaling group or Hadoop worker cluster or whatever autonomously

increases its size, the holistic scheduler should be consulted.  So

we will also have holistic scheduling that comes after Heat as well as

before.  Also consider what does the input to the holistic scheduler

look like, and what tools will emit and otherwise process that input?  If

it looks nothing like a heat template, then a given producer or consumer

will be dedicated to either stuff with holistic scheduling or stuff without

it.  But it is pretty natural for the input to a holistic scheduler

to look like a heat template augmented with policies, because that's pretty

much what is needed; this could unify an otherwise split ecosystem of tools.</font></tt>

<br>

<br><tt><font size=2>What do you think?</font></tt>

<br>

<br><tt><font size=2>Thanks,</font></tt>

<br><tt><font size=2>Mike</font></tt>