<font size=2 face="sans-serif">There has been a lot of discussion around

these issues, let me see if I can break it down into pieces, hopefully

in a way that allows some progress on one of them first.</font>

<br>

<br><font size=2 face="sans-serif">I continue to focus on the timeless

version of the problem, in which the placement question is simply where

can we put some guest(s) now without making plans for whether/how things

may change in the future.  There is also a rich discussion to have

about placement over time (in other communities this is what "scheduling"

means), but that is not my goal here.</font>

<br>

<br><font size=2 face="sans-serif">I see essentially three issues: (i1)

more sophisticated placement criteria, (i2) joint vs sequential decision

making, and (i3) single-service vs cross service.  I think we could

make a roadmap for addressing them in this order.  Do you think the

issues are separable and attackable in this way?</font>

<br>

<br><font size=2 face="sans-serif">Objections have been raised regarding

"complexity", and I would like to be clear about what complexity

we are discussing.  For each of these three issues, I can think of

five kinds of complexity to discuss: the development complexity faced by

us OpenStack developers and/or a cloud provider who wants to add such things,

the runtime costs, the API complexity for allowing the issue to be addressed,

the development complexity faced by the cloud users in writing their placement

problems, and the size of the written problems.</font>

<br>

<br><font size=2 face="sans-serif">(i1) More sophisticated placement criteria.

 For example, I think cloud providers would like to have a way to

express a preference for consolidation, to help them conserve energy.  That's

a fairly rich topic in itself, and I hope the details need not concern

us here.  OTOH, cloud users may want to have a way to express a preference

for spreading their guests out --- for example, spreading around within

an availability zone.  This may seem relatively boring for stateless

web servers, but if you think about things like Zookeeper or Cassandra

servers you see a serious preference for getting as much failure independence

as you can get within a given AZ.  Other examples of what cloud users

may want cross services, like placing VMs and storage volumes in close

proximity (by some measure).</font>

<br>

<br><font size=2 face="sans-serif">Let's consider a couple of concrete

examples from the record.  One is referenced in the current revision

of </font><a href=https://review.openstack.org/#/c/96543/><font size=2 face="sans-serif">https://review.openstack.org/#/c/96543/</font></a><font size=2 face="sans-serif">

- namely, </font><a href=http://goo.gl/vuHcz2><font size=2 face="sans-serif">http://goo.gl/vuHcz2</font></a><font size=2 face="sans-serif">

.  There Yathi exhibits a binary integer programming problem for placing

a group of guests so as to minimize the sum of (host's free RAM before

any members of this group are placed).  Minimizing that objective

has a roughly consolidating effect in many situations, even though it does

not directly express exactly consolidation (exactly expressing consolidation

requires a non-linear objective).  Similarly, maximizing that objective

has something of a tendency to spread things out, even though the objective

does not exactly express spreading.  (BTW, I do not think we should

assume that all hosts in an AZ have the same capacity vector.)</font>

<br>

<br><font size=2 face="sans-serif">Another example is at </font><a href="https://wiki.openstack.org/wiki/Heat/PolicyExtension#LLMN_Anti-Collocation"><font size=2 face="sans-serif">https://wiki.openstack.org/wiki/Heat/PolicyExtension#LLMN_Anti-Collocation</font></a><font size=2 face="sans-serif">

.  There we see a criterion that is a precise spreading condition,

for spreading a group of guests to a certain precise degree.  (BTW,

I tend to use "collocation" and "anti-collocation"

when speaking of affinity stated in terms of location --- as opposed, for

example, to network proximity.)</font>

<br>

<br><font size=2 face="sans-serif">For any collection of more or less sophisticated

placement criteria, if we consider the question of placing a single guest

--- supposing it is stipulated that some guests have already had their

placement determined and others will be determined later --- the question

boils down to the exactly the terms that we have in Nova's FilterScheduler

today: (a) which hosts can host this guest and (b) ranking the acceptable

hosts.  Of course, that is not the only way to address a problem of

placing several guests --- but it is one valid way, even if the placement

problem is stated as a mixed integer programming problem.  Group placement

problems are NP hard, so solving them exactly is out of the question.  The

only question is how hard are we going to work at making a good approximation;

this is what (i2) is about, and we will get to that below.</font>

<br>

<br><font size=2 face="sans-serif">Let us review the two concrete examples

with an eye towards the five kinds of complexity.  Let us start with

Yathi's RAM minimization binary integer programming example.  There

is a difficulty if we try to work that example in today's framework.  The

objective is a function of the amount of RAM that was free on each host

*before any group members are placed*.  Today's FilterScheduler works

on one guest at a time, updating host states as it goes along; it does

not offer earlier host states for use in later decisions.  However,

today we can solve a slightly different problem --- which happens to be

an even better approximation of a consolidation problem.  That is

to minimize the sum of (host's available RAM just before the guest is placed),

rather than the sum of (host's available RAM before any guests in the group

are placed).  Today's Nova already has a weigher that the cloud provider

can configure to do the revised sum.  The code complexity of that

weigher is code that evaluates the chosen weight function for a given guest

on a given host.  The runtime cost is the cost of doing that evaluation

for every guest in the group, for every acceptable host, once.  The

additional API complexity is nil, since no changes are required.  There

are no complexities for cloud users in this case, since it is a cloud provider

matter.</font>

<br>

<br><font size=2 face="sans-serif">Let us consider the LLMN anti-collocation

statement applied to a group --- let's say a Nova server group.  This

could be handled in today's FilterScheduler by a filter that, when there

is an LLMN anti-collocation statement in the scheduler hints, tests whether

a proposed placement is consistent with the criterion and the placements

already chosen for that group.  This could be similar to the existing

sort of logic we see in today's various affinity filters.  That sort

of logic evaluates the condition directly, without using anything cached

for the group; this makes the runtime cost higher than necessary.  These

runtime costs are those of a triply nested loop over all group members,

over all otherwise-acceptable hosts, and over all group members already

placed --- that is, about 0.5*H*G^2 iterations of the innermost loop body

if H is the average number of otherwise-acceptable hosts and G is the number

of guests in the group.  If the code framework were expanded so that

the filter could know the decisions made by the FilterScheduler and cache

the number of group members already placed in a given L1 zone, the runtime

complexity could be reduced to H*G iterations of a different innermost

loop body.  The additional API complexity is nil if we leave the framework

alone, otherwise it is the aforementioned expansion of the SPI for filters.

 The development complexity for a cloud user consists of picking the

L1, L2, and N parameters for a given group, and the size of the statement

is the size of those parameters and their binding to the group.</font>

<br>

<br>

<br><font size=2 face="sans-serif">(i2) Joint (AKA simultaneous) vs. sequential

decision-making.  The change here is about changing the placement

framework so that it can, in one invocation, choose the placement for each

member of a general (i.e., not necessarily homogenous) group of guests.

 The motivation for this is the observation that sequential decision

making can paint itself into a corner, since it has no view of where it

is going.  You might think that this is a non-issue if your AZ has

a comfortable amount of spare capacity, but that's not quite right ---

it is individual hosts, rather than an AZ as a whole, that run out of capacity.

 Consider an example in which the cloud provider has a preference

for consolidation --- but it is of secondary importance, the cloud provider

wants to put the cloud users' criteria first.  Suppose a cloud user

has a group of 10 guests that he wants collocated on the same host.  Suppose

the cloud has several hosts that are full, one that has spare capacity

for 5 such guests, and some hosts that are empty and have capacity for

20 such guests each.  Sequential decision making is going to put the

first 5 guests on that partially full host, and then be in a tight spot

(live migration has costs, and is disallowed in some situations).  Joint

decision-making, OTOH, will put all 10 guests on the same formerly empty

host.</font>

<br>

<br><font size=2 face="sans-serif">The development complexity of switching

to joint decision making is illustrated by the existing change for that

--- </font><a href=https://review.openstack.org/#/c/46588/><font size=2 face="sans-serif">https://review.openstack.org/#/c/46588/</font></a><font size=2 face="sans-serif">

.</font>

<br>

<br><font size=2 face="sans-serif">The runtime costs are dependent on input

patterns, the way input is translated into an optimization problem, and

the solver used.  Typically we use H*G binary decision variables,

which puts a lower bound of H*G on runtime costs.  Even for a linear

objective, runtime complexity is exponential in the worst case (the problem

is NP hard, remember) --- but runtime complexity is often much lower in,

for example, the Simplex method.  I hope Yathi can chime in here with

some more detailed statement about what he has seen.</font>

<br>

<br><font size=2 face="sans-serif">My own group has been working with non-linear

objectives and a very flexible solver based on biased sampling.  Its

complexity is O(G*(H+K)) where G and H are as above and K is the number

of pairwise constraints (the only kind understood by this solver); the

leading coefficient is more than one.  In a not-so recent paper (more

recent results are better), my colleague Asser Tantawi was getting decent

results from runs that evaluate between 10^2 and 10^3 samples; here a sample

is a candidate joint placement (containing a candidate host for each guest

in the group).</font>

<br>

<br><font size=2 face="sans-serif">The client API complexity of switching

to joint decision making is centered on the need to present a whole group

placement problem at once.  If we leave orchestration to the clients

(I detect a strong preference for this in the community, although I personally

think alternatives could be viable), this requires introducing introductory

phases in which the group members and their placement criteria are sufficiently

described and then the joint decision is made.  To properly cope with

the fact that stuff can go wrong during orchestration, this would benefit

from also adding a final confirmation phase in which the client declares

that it has done all it is going to do about the group at hand --- and

this, in turn, requires an orchestration leasing mechanism to make the

final phase eventually happen even if the client aborts.  The API

in question does not have to be the Nova API.  There is current work

on separating out scheduling.  But regardless of whether it is part

of Nova or something else, there would need to be an API along these lines.</font>

<br>

<br><font size=2 face="sans-serif">Part of the API complexity is about

providing some way for the chosen placement decisions to get from the introductory

phase into the orchestration phase (where the guests are actually put in

the chosen places).  I can see two approaches.  One is for the

introductory phase to return placement decisions to the client, and the

client pass to them into the guest create/update operations in the orchestration

phase.  Another is based on using a unique identifier for a group

member in both the introductory and orchestration phase, so that the placement

decision can travel under the covers while the client deals with only the

identifier.</font>

<br>

<br><font size=2 face="sans-serif">There are also SPI changes needed, so

that the joint problem is preserved all the way from the API into the scheduler.

 Additionally this needs a way to make a joint claim on host capacity

(there is a potential connection to Blazar here).</font>

<br>

<br><font size=2 face="sans-serif">The additional complexity for developers

of clients is the complexity of: writing code that creates the introductory

description of group members and their placement criteria; modifying the

orchestration code to refer to the results of the introductory phases and

to keep the orchestration lease alive; and writing code that does the final

confirmation.</font>

<br>

<br><font size=2 face="sans-serif">The additional input size is the size

of the needed descriptions of group members (i.e., what capacities they

need).  If this is step 2 in a roadmap then the placement criteria

are already being described (no new/different input is needed for that

--- which is an advantage of this roadmap) once; depending on details the

new API would take one or two copies of each placement criterion.</font>

<br>

<br><font size=2 face="sans-serif">Note that a user who is using Heat has

already written such a description --- a heat template already contains

descriptions of group members and their placement criteria.  A Heat

user would also escape the development burden of using the expanded API,

if Heat covers that (either in upstream code or in provider plugins).</font>

<br>

<br>

<br><font size=2 face="sans-serif">(i3) Single service vs. cross service.

 Issues (i1) and (i2) appear even when you consider only one service.

 Those issues get more intense when you want placement criteria that

involve multiple services (e.g., Cinder and Nova).  I think we already

have a well established agreement that (i3) belongs last on the roadmap.</font>

<br>

<br>

<br><font size=2 face="sans-serif">Thanks,<br>

Mike</font>