[openstack-dev] Separating the issues around smarter/solver/joint/holistic placement
Mike Spreitzer
mspreitz at us.ibm.com
Wed Aug 20 05:21:24 UTC 2014
There has been a lot of discussion around these issues, let me see if I
can break it down into pieces, hopefully in a way that allows some
progress on one of them first.
I continue to focus on the timeless version of the problem, in which the
placement question is simply where can we put some guest(s) now without
making plans for whether/how things may change in the future. There is
also a rich discussion to have about placement over time (in other
communities this is what "scheduling" means), but that is not my goal
here.
I see essentially three issues: (i1) more sophisticated placement
criteria, (i2) joint vs sequential decision making, and (i3)
single-service vs cross service. I think we could make a roadmap for
addressing them in this order. Do you think the issues are separable and
attackable in this way?
Objections have been raised regarding "complexity", and I would like to be
clear about what complexity we are discussing. For each of these three
issues, I can think of five kinds of complexity to discuss: the
development complexity faced by us OpenStack developers and/or a cloud
provider who wants to add such things, the runtime costs, the API
complexity for allowing the issue to be addressed, the development
complexity faced by the cloud users in writing their placement problems,
and the size of the written problems.
(i1) More sophisticated placement criteria. For example, I think cloud
providers would like to have a way to express a preference for
consolidation, to help them conserve energy. That's a fairly rich topic
in itself, and I hope the details need not concern us here. OTOH, cloud
users may want to have a way to express a preference for spreading their
guests out --- for example, spreading around within an availability zone.
This may seem relatively boring for stateless web servers, but if you
think about things like Zookeeper or Cassandra servers you see a serious
preference for getting as much failure independence as you can get within
a given AZ. Other examples of what cloud users may want cross services,
like placing VMs and storage volumes in close proximity (by some measure).
Let's consider a couple of concrete examples from the record. One is
referenced in the current revision of
https://review.openstack.org/#/c/96543/ - namely, http://goo.gl/vuHcz2 .
There Yathi exhibits a binary integer programming problem for placing a
group of guests so as to minimize the sum of (host's free RAM before any
members of this group are placed). Minimizing that objective has a
roughly consolidating effect in many situations, even though it does not
directly express exactly consolidation (exactly expressing consolidation
requires a non-linear objective). Similarly, maximizing that objective
has something of a tendency to spread things out, even though the
objective does not exactly express spreading. (BTW, I do not think we
should assume that all hosts in an AZ have the same capacity vector.)
Another example is at
https://wiki.openstack.org/wiki/Heat/PolicyExtension#LLMN_Anti-Collocation
. There we see a criterion that is a precise spreading condition, for
spreading a group of guests to a certain precise degree. (BTW, I tend to
use "collocation" and "anti-collocation" when speaking of affinity stated
in terms of location --- as opposed, for example, to network proximity.)
For any collection of more or less sophisticated placement criteria, if we
consider the question of placing a single guest --- supposing it is
stipulated that some guests have already had their placement determined
and others will be determined later --- the question boils down to the
exactly the terms that we have in Nova's FilterScheduler today: (a) which
hosts can host this guest and (b) ranking the acceptable hosts. Of
course, that is not the only way to address a problem of placing several
guests --- but it is one valid way, even if the placement problem is
stated as a mixed integer programming problem. Group placement problems
are NP hard, so solving them exactly is out of the question. The only
question is how hard are we going to work at making a good approximation;
this is what (i2) is about, and we will get to that below.
Let us review the two concrete examples with an eye towards the five kinds
of complexity. Let us start with Yathi's RAM minimization binary integer
programming example. There is a difficulty if we try to work that example
in today's framework. The objective is a function of the amount of RAM
that was free on each host *before any group members are placed*. Today's
FilterScheduler works on one guest at a time, updating host states as it
goes along; it does not offer earlier host states for use in later
decisions. However, today we can solve a slightly different problem ---
which happens to be an even better approximation of a consolidation
problem. That is to minimize the sum of (host's available RAM just before
the guest is placed), rather than the sum of (host's available RAM before
any guests in the group are placed). Today's Nova already has a weigher
that the cloud provider can configure to do the revised sum. The code
complexity of that weigher is code that evaluates the chosen weight
function for a given guest on a given host. The runtime cost is the cost
of doing that evaluation for every guest in the group, for every
acceptable host, once. The additional API complexity is nil, since no
changes are required. There are no complexities for cloud users in this
case, since it is a cloud provider matter.
Let us consider the LLMN anti-collocation statement applied to a group ---
let's say a Nova server group. This could be handled in today's
FilterScheduler by a filter that, when there is an LLMN anti-collocation
statement in the scheduler hints, tests whether a proposed placement is
consistent with the criterion and the placements already chosen for that
group. This could be similar to the existing sort of logic we see in
today's various affinity filters. That sort of logic evaluates the
condition directly, without using anything cached for the group; this
makes the runtime cost higher than necessary. These runtime costs are
those of a triply nested loop over all group members, over all
otherwise-acceptable hosts, and over all group members already placed ---
that is, about 0.5*H*G^2 iterations of the innermost loop body if H is the
average number of otherwise-acceptable hosts and G is the number of guests
in the group. If the code framework were expanded so that the filter
could know the decisions made by the FilterScheduler and cache the number
of group members already placed in a given L1 zone, the runtime complexity
could be reduced to H*G iterations of a different innermost loop body. The
additional API complexity is nil if we leave the framework alone,
otherwise it is the aforementioned expansion of the SPI for filters. The
development complexity for a cloud user consists of picking the L1, L2,
and N parameters for a given group, and the size of the statement is the
size of those parameters and their binding to the group.
(i2) Joint (AKA simultaneous) vs. sequential decision-making. The change
here is about changing the placement framework so that it can, in one
invocation, choose the placement for each member of a general (i.e., not
necessarily homogenous) group of guests. The motivation for this is the
observation that sequential decision making can paint itself into a
corner, since it has no view of where it is going. You might think that
this is a non-issue if your AZ has a comfortable amount of spare capacity,
but that's not quite right --- it is individual hosts, rather than an AZ
as a whole, that run out of capacity. Consider an example in which the
cloud provider has a preference for consolidation --- but it is of
secondary importance, the cloud provider wants to put the cloud users'
criteria first. Suppose a cloud user has a group of 10 guests that he
wants collocated on the same host. Suppose the cloud has several hosts
that are full, one that has spare capacity for 5 such guests, and some
hosts that are empty and have capacity for 20 such guests each. Sequential
decision making is going to put the first 5 guests on that partially full
host, and then be in a tight spot (live migration has costs, and is
disallowed in some situations). Joint decision-making, OTOH, will put all
10 guests on the same formerly empty host.
The development complexity of switching to joint decision making is
illustrated by the existing change for that ---
https://review.openstack.org/#/c/46588/ .
The runtime costs are dependent on input patterns, the way input is
translated into an optimization problem, and the solver used. Typically
we use H*G binary decision variables, which puts a lower bound of H*G on
runtime costs. Even for a linear objective, runtime complexity is
exponential in the worst case (the problem is NP hard, remember) --- but
runtime complexity is often much lower in, for example, the Simplex
method. I hope Yathi can chime in here with some more detailed statement
about what he has seen.
My own group has been working with non-linear objectives and a very
flexible solver based on biased sampling. Its complexity is O(G*(H+K))
where G and H are as above and K is the number of pairwise constraints
(the only kind understood by this solver); the leading coefficient is more
than one. In a not-so recent paper (more recent results are better), my
colleague Asser Tantawi was getting decent results from runs that evaluate
between 10^2 and 10^3 samples; here a sample is a candidate joint
placement (containing a candidate host for each guest in the group).
The client API complexity of switching to joint decision making is
centered on the need to present a whole group placement problem at once.
If we leave orchestration to the clients (I detect a strong preference for
this in the community, although I personally think alternatives could be
viable), this requires introducing introductory phases in which the group
members and their placement criteria are sufficiently described and then
the joint decision is made. To properly cope with the fact that stuff can
go wrong during orchestration, this would benefit from also adding a final
confirmation phase in which the client declares that it has done all it is
going to do about the group at hand --- and this, in turn, requires an
orchestration leasing mechanism to make the final phase eventually happen
even if the client aborts. The API in question does not have to be the
Nova API. There is current work on separating out scheduling. But
regardless of whether it is part of Nova or something else, there would
need to be an API along these lines.
Part of the API complexity is about providing some way for the chosen
placement decisions to get from the introductory phase into the
orchestration phase (where the guests are actually put in the chosen
places). I can see two approaches. One is for the introductory phase to
return placement decisions to the client, and the client pass to them into
the guest create/update operations in the orchestration phase. Another is
based on using a unique identifier for a group member in both the
introductory and orchestration phase, so that the placement decision can
travel under the covers while the client deals with only the identifier.
There are also SPI changes needed, so that the joint problem is preserved
all the way from the API into the scheduler. Additionally this needs a
way to make a joint claim on host capacity (there is a potential
connection to Blazar here).
The additional complexity for developers of clients is the complexity of:
writing code that creates the introductory description of group members
and their placement criteria; modifying the orchestration code to refer to
the results of the introductory phases and to keep the orchestration lease
alive; and writing code that does the final confirmation.
The additional input size is the size of the needed descriptions of group
members (i.e., what capacities they need). If this is step 2 in a roadmap
then the placement criteria are already being described (no new/different
input is needed for that --- which is an advantage of this roadmap) once;
depending on details the new API would take one or two copies of each
placement criterion.
Note that a user who is using Heat has already written such a description
--- a heat template already contains descriptions of group members and
their placement criteria. A Heat user would also escape the development
burden of using the expanded API, if Heat covers that (either in upstream
code or in provider plugins).
(i3) Single service vs. cross service. Issues (i1) and (i2) appear even
when you consider only one service. Those issues get more intense when
you want placement criteria that involve multiple services (e.g., Cinder
and Nova). I think we already have a well established agreement that (i3)
belongs last on the roadmap.
Thanks,
Mike
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140820/e7945e10/attachment.html>
More information about the OpenStack-dev
mailing list