[openstack-dev] [nova] Separating the issues around smarter/solver/joint/holistic placement
Mike Spreitzer
mspreitz at us.ibm.com
Wed Aug 20 05:27:51 UTC 2014
This is primarily an issue for Nova.
Mike Spreitzer/Watson/IBM at IBMUS wrote on 08/20/2014 01:21:24 AM:
> From: Mike Spreitzer/Watson/IBM at IBMUS
> To: "OpenStack Development Mailing List"
<openstack-dev at lists.openstack.org>,
> Date: 08/20/2014 01:24 AM
> Subject: [openstack-dev] Separating the issues around smarter/
> solver/joint/holistic placement
>
> There has been a lot of discussion around these issues, let me see
> if I can break it down into pieces, hopefully in a way that allows
> some progress on one of them first.
>
> I continue to focus on the timeless version of the problem, in which
> the placement question is simply where can we put some guest(s) now
> without making plans for whether/how things may change in the
> future. There is also a rich discussion to have about placement
> over time (in other communities this is what "scheduling" means),
> but that is not my goal here.
>
> I see essentially three issues: (i1) more sophisticated placement
> criteria, (i2) joint vs sequential decision making, and (i3) single-
> service vs cross service. I think we could make a roadmap for
> addressing them in this order. Do you think the issues are
> separable and attackable in this way?
>
> Objections have been raised regarding "complexity", and I would like
> to be clear about what complexity we are discussing. For each of
> these three issues, I can think of five kinds of complexity to
> discuss: the development complexity faced by us OpenStack developers
> and/or a cloud provider who wants to add such things, the runtime
> costs, the API complexity for allowing the issue to be addressed,
> the development complexity faced by the cloud users in writing their
> placement problems, and the size of the written problems.
>
> (i1) More sophisticated placement criteria. For example, I think
> cloud providers would like to have a way to express a preference for
> consolidation, to help them conserve energy. That's a fairly rich
> topic in itself, and I hope the details need not concern us here.
> OTOH, cloud users may want to have a way to express a preference for
> spreading their guests out --- for example, spreading around within
> an availability zone. This may seem relatively boring for stateless
> web servers, but if you think about things like Zookeeper or
> Cassandra servers you see a serious preference for getting as much
> failure independence as you can get within a given AZ. Other
> examples of what cloud users may want cross services, like placing
> VMs and storage volumes in close proximity (by some measure).
>
> Let's consider a couple of concrete examples from the record. One
> is referenced in the current revision of https://
> review.openstack.org/#/c/96543/ - namely, http://goo.gl/vuHcz2 .
> There Yathi exhibits a binary integer programming problem for
> placing a group of guests so as to minimize the sum of (host's free
> RAM before any members of this group are placed). Minimizing that
> objective has a roughly consolidating effect in many situations,
> even though it does not directly express exactly consolidation
> (exactly expressing consolidation requires a non-linear objective).
> Similarly, maximizing that objective has something of a tendency to
> spread things out, even though the objective does not exactly
> express spreading. (BTW, I do not think we should assume that all
> hosts in an AZ have the same capacity vector.)
>
> Another example is at https://wiki.openstack.org/wiki/Heat/
> PolicyExtension#LLMN_Anti-Collocation . There we see a criterion
> that is a precise spreading condition, for spreading a group of
> guests to a certain precise degree. (BTW, I tend to use
> "collocation" and "anti-collocation" when speaking of affinity
> stated in terms of location --- as opposed, for example, to network
> proximity.)
>
> For any collection of more or less sophisticated placement criteria,
> if we consider the question of placing a single guest --- supposing
> it is stipulated that some guests have already had their placement
> determined and others will be determined later --- the question
> boils down to the exactly the terms that we have in Nova's
> FilterScheduler today: (a) which hosts can host this guest and (b)
> ranking the acceptable hosts. Of course, that is not the only way
> to address a problem of placing several guests --- but it is one
> valid way, even if the placement problem is stated as a mixed
> integer programming problem. Group placement problems are NP hard,
> so solving them exactly is out of the question. The only question
> is how hard are we going to work at making a good approximation;
> this is what (i2) is about, and we will get to that below.
>
> Let us review the two concrete examples with an eye towards the five
> kinds of complexity. Let us start with Yathi's RAM minimization
> binary integer programming example. There is a difficulty if we try
> to work that example in today's framework. The objective is a
> function of the amount of RAM that was free on each host *before any
> group members are placed*. Today's FilterScheduler works on one
> guest at a time, updating host states as it goes along; it does not
> offer earlier host states for use in later decisions. However,
> today we can solve a slightly different problem --- which happens to
> be an even better approximation of a consolidation problem. That is
> to minimize the sum of (host's available RAM just before the guest
> is placed), rather than the sum of (host's available RAM before any
> guests in the group are placed). Today's Nova already has a weigher
> that the cloud provider can configure to do the revised sum. The
> code complexity of that weigher is code that evaluates the chosen
> weight function for a given guest on a given host. The runtime cost
> is the cost of doing that evaluation for every guest in the group,
> for every acceptable host, once. The additional API complexity is
> nil, since no changes are required. There are no complexities for
> cloud users in this case, since it is a cloud provider matter.
>
> Let us consider the LLMN anti-collocation statement applied to a
> group --- let's say a Nova server group. This could be handled in
> today's FilterScheduler by a filter that, when there is an LLMN
> anti-collocation statement in the scheduler hints, tests whether a
> proposed placement is consistent with the criterion and the
> placements already chosen for that group. This could be similar to
> the existing sort of logic we see in today's various affinity
> filters. That sort of logic evaluates the condition directly,
> without using anything cached for the group; this makes the runtime
> cost higher than necessary. These runtime costs are those of a
> triply nested loop over all group members, over all otherwise-
> acceptable hosts, and over all group members already placed --- that
> is, about 0.5*H*G^2 iterations of the innermost loop body if H is
> the average number of otherwise-acceptable hosts and G is the number
> of guests in the group. If the code framework were expanded so that
> the filter could know the decisions made by the FilterScheduler and
> cache the number of group members already placed in a given L1 zone,
> the runtime complexity could be reduced to H*G iterations of a
> different innermost loop body. The additional API complexity is nil
> if we leave the framework alone, otherwise it is the aforementioned
> expansion of the SPI for filters. The development complexity for a
> cloud user consists of picking the L1, L2, and N parameters for a
> given group, and the size of the statement is the size of those
> parameters and their binding to the group.
>
>
> (i2) Joint (AKA simultaneous) vs. sequential decision-making. The
> change here is about changing the placement framework so that it
> can, in one invocation, choose the placement for each member of a
> general (i.e., not necessarily homogenous) group of guests. The
> motivation for this is the observation that sequential decision
> making can paint itself into a corner, since it has no view of where
> it is going. You might think that this is a non-issue if your AZ
> has a comfortable amount of spare capacity, but that's not quite
> right --- it is individual hosts, rather than an AZ as a whole, that
> run out of capacity. Consider an example in which the cloud
> provider has a preference for consolidation --- but it is of
> secondary importance, the cloud provider wants to put the cloud
> users' criteria first. Suppose a cloud user has a group of 10
> guests that he wants collocated on the same host. Suppose the cloud
> has several hosts that are full, one that has spare capacity for 5
> such guests, and some hosts that are empty and have capacity for 20
> such guests each. Sequential decision making is going to put the
> first 5 guests on that partially full host, and then be in a tight
> spot (live migration has costs, and is disallowed in some
> situations). Joint decision-making, OTOH, will put all 10 guests on
> the same formerly empty host.
>
> The development complexity of switching to joint decision making is
> illustrated by the existing change for that --- https://
> review.openstack.org/#/c/46588/ .
>
> The runtime costs are dependent on input patterns, the way input is
> translated into an optimization problem, and the solver used.
> Typically we use H*G binary decision variables, which puts a lower
> bound of H*G on runtime costs. Even for a linear objective, runtime
> complexity is exponential in the worst case (the problem is NP hard,
> remember) --- but runtime complexity is often much lower in, for
> example, the Simplex method. I hope Yathi can chime in here with
> some more detailed statement about what he has seen.
>
> My own group has been working with non-linear objectives and a very
> flexible solver based on biased sampling. Its complexity is O(G*(H
> +K)) where G and H are as above and K is the number of pairwise
> constraints (the only kind understood by this solver); the leading
> coefficient is more than one. In a not-so recent paper (more recent
> results are better), my colleague Asser Tantawi was getting decent
> results from runs that evaluate between 10^2 and 10^3 samples; here
> a sample is a candidate joint placement (containing a candidate host
> for each guest in the group).
>
> The client API complexity of switching to joint decision making is
> centered on the need to present a whole group placement problem at
> once. If we leave orchestration to the clients (I detect a strong
> preference for this in the community, although I personally think
> alternatives could be viable), this requires introducing
> introductory phases in which the group members and their placement
> criteria are sufficiently described and then the joint decision is
> made. To properly cope with the fact that stuff can go wrong during
> orchestration, this would benefit from also adding a final
> confirmation phase in which the client declares that it has done all
> it is going to do about the group at hand --- and this, in turn,
> requires an orchestration leasing mechanism to make the final phase
> eventually happen even if the client aborts. The API in question
> does not have to be the Nova API. There is current work on
> separating out scheduling. But regardless of whether it is part of
> Nova or something else, there would need to be an API along these lines.
>
> Part of the API complexity is about providing some way for the
> chosen placement decisions to get from the introductory phase into
> the orchestration phase (where the guests are actually put in the
> chosen places). I can see two approaches. One is for the
> introductory phase to return placement decisions to the client, and
> the client pass to them into the guest create/update operations in
> the orchestration phase. Another is based on using a unique
> identifier for a group member in both the introductory and
> orchestration phase, so that the placement decision can travel under
> the covers while the client deals with only the identifier.
>
> There are also SPI changes needed, so that the joint problem is
> preserved all the way from the API into the scheduler. Additionally
> this needs a way to make a joint claim on host capacity (there is a
> potential connection to Blazar here).
>
> The additional complexity for developers of clients is the
> complexity of: writing code that creates the introductory
> description of group members and their placement criteria; modifying
> the orchestration code to refer to the results of the introductory
> phases and to keep the orchestration lease alive; and writing code
> that does the final confirmation.
>
> The additional input size is the size of the needed descriptions of
> group members (i.e., what capacities they need). If this is step 2
> in a roadmap then the placement criteria are already being described
> (no new/different input is needed for that --- which is an advantage
> of this roadmap) once; depending on details the new API would take
> one or two copies of each placement criterion.
>
> Note that a user who is using Heat has already written such a
> description --- a heat template already contains descriptions of
> group members and their placement criteria. A Heat user would also
> escape the development burden of using the expanded API, if Heat
> covers that (either in upstream code or in provider plugins).
>
>
> (i3) Single service vs. cross service. Issues (i1) and (i2) appear
> even when you consider only one service. Those issues get more
> intense when you want placement criteria that involve multiple
> services (e.g., Cinder and Nova). I think we already have a well
> established agreement that (i3) belongs last on the roadmap.
>
>
> Thanks,
> Mike_______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140820/f0a517c8/attachment.html>
More information about the OpenStack-dev
mailing list