[openstack-dev] Separating the issues around smarter/solver/joint/holistic placement

Mike Spreitzer mspreitz at us.ibm.com
Wed Aug 20 05:21:24 UTC 2014


There has been a lot of discussion around these issues, let me see if I 
can break it down into pieces, hopefully in a way that allows some 
progress on one of them first.

I continue to focus on the timeless version of the problem, in which the 
placement question is simply where can we put some guest(s) now without 
making plans for whether/how things may change in the future.  There is 
also a rich discussion to have about placement over time (in other 
communities this is what "scheduling" means), but that is not my goal 
here.

I see essentially three issues: (i1) more sophisticated placement 
criteria, (i2) joint vs sequential decision making, and (i3) 
single-service vs cross service.  I think we could make a roadmap for 
addressing them in this order.  Do you think the issues are separable and 
attackable in this way?

Objections have been raised regarding "complexity", and I would like to be 
clear about what complexity we are discussing.  For each of these three 
issues, I can think of five kinds of complexity to discuss: the 
development complexity faced by us OpenStack developers and/or a cloud 
provider who wants to add such things, the runtime costs, the API 
complexity for allowing the issue to be addressed, the development 
complexity faced by the cloud users in writing their placement problems, 
and the size of the written problems.

(i1) More sophisticated placement criteria.  For example, I think cloud 
providers would like to have a way to express a preference for 
consolidation, to help them conserve energy.  That's a fairly rich topic 
in itself, and I hope the details need not concern us here.  OTOH, cloud 
users may want to have a way to express a preference for spreading their 
guests out --- for example, spreading around within an availability zone. 
This may seem relatively boring for stateless web servers, but if you 
think about things like Zookeeper or Cassandra servers you see a serious 
preference for getting as much failure independence as you can get within 
a given AZ.  Other examples of what cloud users may want cross services, 
like placing VMs and storage volumes in close proximity (by some measure).

Let's consider a couple of concrete examples from the record.  One is 
referenced in the current revision of 
https://review.openstack.org/#/c/96543/ - namely, http://goo.gl/vuHcz2 . 
There Yathi exhibits a binary integer programming problem for placing a 
group of guests so as to minimize the sum of (host's free RAM before any 
members of this group are placed).  Minimizing that objective has a 
roughly consolidating effect in many situations, even though it does not 
directly express exactly consolidation (exactly expressing consolidation 
requires a non-linear objective).  Similarly, maximizing that objective 
has something of a tendency to spread things out, even though the 
objective does not exactly express spreading.  (BTW, I do not think we 
should assume that all hosts in an AZ have the same capacity vector.)

Another example is at 
https://wiki.openstack.org/wiki/Heat/PolicyExtension#LLMN_Anti-Collocation 
.  There we see a criterion that is a precise spreading condition, for 
spreading a group of guests to a certain precise degree.  (BTW, I tend to 
use "collocation" and "anti-collocation" when speaking of affinity stated 
in terms of location --- as opposed, for example, to network proximity.)

For any collection of more or less sophisticated placement criteria, if we 
consider the question of placing a single guest --- supposing it is 
stipulated that some guests have already had their placement determined 
and others will be determined later --- the question boils down to the 
exactly the terms that we have in Nova's FilterScheduler today: (a) which 
hosts can host this guest and (b) ranking the acceptable hosts.  Of 
course, that is not the only way to address a problem of placing several 
guests --- but it is one valid way, even if the placement problem is 
stated as a mixed integer programming problem.  Group placement problems 
are NP hard, so solving them exactly is out of the question.  The only 
question is how hard are we going to work at making a good approximation; 
this is what (i2) is about, and we will get to that below.

Let us review the two concrete examples with an eye towards the five kinds 
of complexity.  Let us start with Yathi's RAM minimization binary integer 
programming example.  There is a difficulty if we try to work that example 
in today's framework.  The objective is a function of the amount of RAM 
that was free on each host *before any group members are placed*.  Today's 
FilterScheduler works on one guest at a time, updating host states as it 
goes along; it does not offer earlier host states for use in later 
decisions.  However, today we can solve a slightly different problem --- 
which happens to be an even better approximation of a consolidation 
problem.  That is to minimize the sum of (host's available RAM just before 
the guest is placed), rather than the sum of (host's available RAM before 
any guests in the group are placed).  Today's Nova already has a weigher 
that the cloud provider can configure to do the revised sum.  The code 
complexity of that weigher is code that evaluates the chosen weight 
function for a given guest on a given host.  The runtime cost is the cost 
of doing that evaluation for every guest in the group, for every 
acceptable host, once.  The additional API complexity is nil, since no 
changes are required.  There are no complexities for cloud users in this 
case, since it is a cloud provider matter.

Let us consider the LLMN anti-collocation statement applied to a group --- 
let's say a Nova server group.  This could be handled in today's 
FilterScheduler by a filter that, when there is an LLMN anti-collocation 
statement in the scheduler hints, tests whether a proposed placement is 
consistent with the criterion and the placements already chosen for that 
group.  This could be similar to the existing sort of logic we see in 
today's various affinity filters.  That sort of logic evaluates the 
condition directly, without using anything cached for the group; this 
makes the runtime cost higher than necessary.  These runtime costs are 
those of a triply nested loop over all group members, over all 
otherwise-acceptable hosts, and over all group members already placed --- 
that is, about 0.5*H*G^2 iterations of the innermost loop body if H is the 
average number of otherwise-acceptable hosts and G is the number of guests 
in the group.  If the code framework were expanded so that the filter 
could know the decisions made by the FilterScheduler and cache the number 
of group members already placed in a given L1 zone, the runtime complexity 
could be reduced to H*G iterations of a different innermost loop body. The 
additional API complexity is nil if we leave the framework alone, 
otherwise it is the aforementioned expansion of the SPI for filters.  The 
development complexity for a cloud user consists of picking the L1, L2, 
and N parameters for a given group, and the size of the statement is the 
size of those parameters and their binding to the group.


(i2) Joint (AKA simultaneous) vs. sequential decision-making.  The change 
here is about changing the placement framework so that it can, in one 
invocation, choose the placement for each member of a general (i.e., not 
necessarily homogenous) group of guests.  The motivation for this is the 
observation that sequential decision making can paint itself into a 
corner, since it has no view of where it is going.  You might think that 
this is a non-issue if your AZ has a comfortable amount of spare capacity, 
but that's not quite right --- it is individual hosts, rather than an AZ 
as a whole, that run out of capacity.  Consider an example in which the 
cloud provider has a preference for consolidation --- but it is of 
secondary importance, the cloud provider wants to put the cloud users' 
criteria first.  Suppose a cloud user has a group of 10 guests that he 
wants collocated on the same host.  Suppose the cloud has several hosts 
that are full, one that has spare capacity for 5 such guests, and some 
hosts that are empty and have capacity for 20 such guests each. Sequential 
decision making is going to put the first 5 guests on that partially full 
host, and then be in a tight spot (live migration has costs, and is 
disallowed in some situations).  Joint decision-making, OTOH, will put all 
10 guests on the same formerly empty host.

The development complexity of switching to joint decision making is 
illustrated by the existing change for that --- 
https://review.openstack.org/#/c/46588/ .

The runtime costs are dependent on input patterns, the way input is 
translated into an optimization problem, and the solver used.  Typically 
we use H*G binary decision variables, which puts a lower bound of H*G on 
runtime costs.  Even for a linear objective, runtime complexity is 
exponential in the worst case (the problem is NP hard, remember) --- but 
runtime complexity is often much lower in, for example, the Simplex 
method.  I hope Yathi can chime in here with some more detailed statement 
about what he has seen.

My own group has been working with non-linear objectives and a very 
flexible solver based on biased sampling.  Its complexity is O(G*(H+K)) 
where G and H are as above and K is the number of pairwise constraints 
(the only kind understood by this solver); the leading coefficient is more 
than one.  In a not-so recent paper (more recent results are better), my 
colleague Asser Tantawi was getting decent results from runs that evaluate 
between 10^2 and 10^3 samples; here a sample is a candidate joint 
placement (containing a candidate host for each guest in the group).

The client API complexity of switching to joint decision making is 
centered on the need to present a whole group placement problem at once. 
If we leave orchestration to the clients (I detect a strong preference for 
this in the community, although I personally think alternatives could be 
viable), this requires introducing introductory phases in which the group 
members and their placement criteria are sufficiently described and then 
the joint decision is made.  To properly cope with the fact that stuff can 
go wrong during orchestration, this would benefit from also adding a final 
confirmation phase in which the client declares that it has done all it is 
going to do about the group at hand --- and this, in turn, requires an 
orchestration leasing mechanism to make the final phase eventually happen 
even if the client aborts.  The API in question does not have to be the 
Nova API.  There is current work on separating out scheduling.  But 
regardless of whether it is part of Nova or something else, there would 
need to be an API along these lines.

Part of the API complexity is about providing some way for the chosen 
placement decisions to get from the introductory phase into the 
orchestration phase (where the guests are actually put in the chosen 
places).  I can see two approaches.  One is for the introductory phase to 
return placement decisions to the client, and the client pass to them into 
the guest create/update operations in the orchestration phase.  Another is 
based on using a unique identifier for a group member in both the 
introductory and orchestration phase, so that the placement decision can 
travel under the covers while the client deals with only the identifier.

There are also SPI changes needed, so that the joint problem is preserved 
all the way from the API into the scheduler.  Additionally this needs a 
way to make a joint claim on host capacity (there is a potential 
connection to Blazar here).

The additional complexity for developers of clients is the complexity of: 
writing code that creates the introductory description of group members 
and their placement criteria; modifying the orchestration code to refer to 
the results of the introductory phases and to keep the orchestration lease 
alive; and writing code that does the final confirmation.

The additional input size is the size of the needed descriptions of group 
members (i.e., what capacities they need).  If this is step 2 in a roadmap 
then the placement criteria are already being described (no new/different 
input is needed for that --- which is an advantage of this roadmap) once; 
depending on details the new API would take one or two copies of each 
placement criterion.

Note that a user who is using Heat has already written such a description 
--- a heat template already contains descriptions of group members and 
their placement criteria.  A Heat user would also escape the development 
burden of using the expanded API, if Heat covers that (either in upstream 
code or in provider plugins).


(i3) Single service vs. cross service.  Issues (i1) and (i2) appear even 
when you consider only one service.  Those issues get more intense when 
you want placement criteria that involve multiple services (e.g., Cinder 
and Nova).  I think we already have a well established agreement that (i3) 
belongs last on the roadmap.


Thanks,
Mike
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140820/e7945e10/attachment.html>


More information about the OpenStack-dev mailing list