[openstack-dev] [nova] Separating the issues around smarter/solver/joint/holistic placement

Mike Spreitzer mspreitz at us.ibm.com
Wed Aug 20 05:27:51 UTC 2014


This is primarily an issue for Nova.

Mike Spreitzer/Watson/IBM at IBMUS wrote on 08/20/2014 01:21:24 AM:

> From: Mike Spreitzer/Watson/IBM at IBMUS
> To: "OpenStack Development Mailing List" 
<openstack-dev at lists.openstack.org>, 
> Date: 08/20/2014 01:24 AM
> Subject: [openstack-dev] Separating the issues around smarter/
> solver/joint/holistic placement
> 
> There has been a lot of discussion around these issues, let me see 
> if I can break it down into pieces, hopefully in a way that allows 
> some progress on one of them first. 
> 
> I continue to focus on the timeless version of the problem, in which
> the placement question is simply where can we put some guest(s) now 
> without making plans for whether/how things may change in the 
> future.  There is also a rich discussion to have about placement 
> over time (in other communities this is what "scheduling" means), 
> but that is not my goal here. 
> 
> I see essentially three issues: (i1) more sophisticated placement 
> criteria, (i2) joint vs sequential decision making, and (i3) single-
> service vs cross service.  I think we could make a roadmap for 
> addressing them in this order.  Do you think the issues are 
> separable and attackable in this way? 
> 
> Objections have been raised regarding "complexity", and I would like
> to be clear about what complexity we are discussing.  For each of 
> these three issues, I can think of five kinds of complexity to 
> discuss: the development complexity faced by us OpenStack developers
> and/or a cloud provider who wants to add such things, the runtime 
> costs, the API complexity for allowing the issue to be addressed, 
> the development complexity faced by the cloud users in writing their
> placement problems, and the size of the written problems. 
> 
> (i1) More sophisticated placement criteria.  For example, I think 
> cloud providers would like to have a way to express a preference for
> consolidation, to help them conserve energy.  That's a fairly rich 
> topic in itself, and I hope the details need not concern us here. 
> OTOH, cloud users may want to have a way to express a preference for
> spreading their guests out --- for example, spreading around within 
> an availability zone.  This may seem relatively boring for stateless
> web servers, but if you think about things like Zookeeper or 
> Cassandra servers you see a serious preference for getting as much 
> failure independence as you can get within a given AZ.  Other 
> examples of what cloud users may want cross services, like placing 
> VMs and storage volumes in close proximity (by some measure). 
> 
> Let's consider a couple of concrete examples from the record.  One 
> is referenced in the current revision of https://
> review.openstack.org/#/c/96543/ - namely, http://goo.gl/vuHcz2 . 
> There Yathi exhibits a binary integer programming problem for 
> placing a group of guests so as to minimize the sum of (host's free 
> RAM before any members of this group are placed).  Minimizing that 
> objective has a roughly consolidating effect in many situations, 
> even though it does not directly express exactly consolidation 
> (exactly expressing consolidation requires a non-linear objective). 
> Similarly, maximizing that objective has something of a tendency to 
> spread things out, even though the objective does not exactly 
> express spreading.  (BTW, I do not think we should assume that all 
> hosts in an AZ have the same capacity vector.) 
> 
> Another example is at https://wiki.openstack.org/wiki/Heat/
> PolicyExtension#LLMN_Anti-Collocation .  There we see a criterion 
> that is a precise spreading condition, for spreading a group of 
> guests to a certain precise degree.  (BTW, I tend to use 
> "collocation" and "anti-collocation" when speaking of affinity 
> stated in terms of location --- as opposed, for example, to network 
> proximity.) 
> 
> For any collection of more or less sophisticated placement criteria,
> if we consider the question of placing a single guest --- supposing 
> it is stipulated that some guests have already had their placement 
> determined and others will be determined later --- the question 
> boils down to the exactly the terms that we have in Nova's 
> FilterScheduler today: (a) which hosts can host this guest and (b) 
> ranking the acceptable hosts.  Of course, that is not the only way 
> to address a problem of placing several guests --- but it is one 
> valid way, even if the placement problem is stated as a mixed 
> integer programming problem.  Group placement problems are NP hard, 
> so solving them exactly is out of the question.  The only question 
> is how hard are we going to work at making a good approximation; 
> this is what (i2) is about, and we will get to that below. 
> 
> Let us review the two concrete examples with an eye towards the five
> kinds of complexity.  Let us start with Yathi's RAM minimization 
> binary integer programming example.  There is a difficulty if we try
> to work that example in today's framework.  The objective is a 
> function of the amount of RAM that was free on each host *before any
> group members are placed*.  Today's FilterScheduler works on one 
> guest at a time, updating host states as it goes along; it does not 
> offer earlier host states for use in later decisions.  However, 
> today we can solve a slightly different problem --- which happens to
> be an even better approximation of a consolidation problem.  That is
> to minimize the sum of (host's available RAM just before the guest 
> is placed), rather than the sum of (host's available RAM before any 
> guests in the group are placed).  Today's Nova already has a weigher
> that the cloud provider can configure to do the revised sum.  The 
> code complexity of that weigher is code that evaluates the chosen 
> weight function for a given guest on a given host.  The runtime cost
> is the cost of doing that evaluation for every guest in the group, 
> for every acceptable host, once.  The additional API complexity is 
> nil, since no changes are required.  There are no complexities for 
> cloud users in this case, since it is a cloud provider matter. 
> 
> Let us consider the LLMN anti-collocation statement applied to a 
> group --- let's say a Nova server group.  This could be handled in 
> today's FilterScheduler by a filter that, when there is an LLMN 
> anti-collocation statement in the scheduler hints, tests whether a 
> proposed placement is consistent with the criterion and the 
> placements already chosen for that group.  This could be similar to 
> the existing sort of logic we see in today's various affinity 
> filters.  That sort of logic evaluates the condition directly, 
> without using anything cached for the group; this makes the runtime 
> cost higher than necessary.  These runtime costs are those of a 
> triply nested loop over all group members, over all otherwise-
> acceptable hosts, and over all group members already placed --- that
> is, about 0.5*H*G^2 iterations of the innermost loop body if H is 
> the average number of otherwise-acceptable hosts and G is the number
> of guests in the group.  If the code framework were expanded so that
> the filter could know the decisions made by the FilterScheduler and 
> cache the number of group members already placed in a given L1 zone,
> the runtime complexity could be reduced to H*G iterations of a 
> different innermost loop body.  The additional API complexity is nil
> if we leave the framework alone, otherwise it is the aforementioned 
> expansion of the SPI for filters.  The development complexity for a 
> cloud user consists of picking the L1, L2, and N parameters for a 
> given group, and the size of the statement is the size of those 
> parameters and their binding to the group. 
> 
> 
> (i2) Joint (AKA simultaneous) vs. sequential decision-making.  The 
> change here is about changing the placement framework so that it 
> can, in one invocation, choose the placement for each member of a 
> general (i.e., not necessarily homogenous) group of guests.  The 
> motivation for this is the observation that sequential decision 
> making can paint itself into a corner, since it has no view of where
> it is going.  You might think that this is a non-issue if your AZ 
> has a comfortable amount of spare capacity, but that's not quite 
> right --- it is individual hosts, rather than an AZ as a whole, that
> run out of capacity.  Consider an example in which the cloud 
> provider has a preference for consolidation --- but it is of 
> secondary importance, the cloud provider wants to put the cloud 
> users' criteria first.  Suppose a cloud user has a group of 10 
> guests that he wants collocated on the same host.  Suppose the cloud
> has several hosts that are full, one that has spare capacity for 5 
> such guests, and some hosts that are empty and have capacity for 20 
> such guests each.  Sequential decision making is going to put the 
> first 5 guests on that partially full host, and then be in a tight 
> spot (live migration has costs, and is disallowed in some 
> situations).  Joint decision-making, OTOH, will put all 10 guests on
> the same formerly empty host. 
> 
> The development complexity of switching to joint decision making is 
> illustrated by the existing change for that --- https://
> review.openstack.org/#/c/46588/ . 
> 
> The runtime costs are dependent on input patterns, the way input is 
> translated into an optimization problem, and the solver used. 
> Typically we use H*G binary decision variables, which puts a lower 
> bound of H*G on runtime costs.  Even for a linear objective, runtime
> complexity is exponential in the worst case (the problem is NP hard,
> remember) --- but runtime complexity is often much lower in, for 
> example, the Simplex method.  I hope Yathi can chime in here with 
> some more detailed statement about what he has seen. 
> 
> My own group has been working with non-linear objectives and a very 
> flexible solver based on biased sampling.  Its complexity is O(G*(H
> +K)) where G and H are as above and K is the number of pairwise 
> constraints (the only kind understood by this solver); the leading 
> coefficient is more than one.  In a not-so recent paper (more recent
> results are better), my colleague Asser Tantawi was getting decent 
> results from runs that evaluate between 10^2 and 10^3 samples; here 
> a sample is a candidate joint placement (containing a candidate host
> for each guest in the group). 
> 
> The client API complexity of switching to joint decision making is 
> centered on the need to present a whole group placement problem at 
> once.  If we leave orchestration to the clients (I detect a strong 
> preference for this in the community, although I personally think 
> alternatives could be viable), this requires introducing 
> introductory phases in which the group members and their placement 
> criteria are sufficiently described and then the joint decision is 
> made.  To properly cope with the fact that stuff can go wrong during
> orchestration, this would benefit from also adding a final 
> confirmation phase in which the client declares that it has done all
> it is going to do about the group at hand --- and this, in turn, 
> requires an orchestration leasing mechanism to make the final phase 
> eventually happen even if the client aborts.  The API in question 
> does not have to be the Nova API.  There is current work on 
> separating out scheduling.  But regardless of whether it is part of 
> Nova or something else, there would need to be an API along these lines. 

> 
> Part of the API complexity is about providing some way for the 
> chosen placement decisions to get from the introductory phase into 
> the orchestration phase (where the guests are actually put in the 
> chosen places).  I can see two approaches.  One is for the 
> introductory phase to return placement decisions to the client, and 
> the client pass to them into the guest create/update operations in 
> the orchestration phase.  Another is based on using a unique 
> identifier for a group member in both the introductory and 
> orchestration phase, so that the placement decision can travel under
> the covers while the client deals with only the identifier. 
> 
> There are also SPI changes needed, so that the joint problem is 
> preserved all the way from the API into the scheduler.  Additionally
> this needs a way to make a joint claim on host capacity (there is a 
> potential connection to Blazar here). 
> 
> The additional complexity for developers of clients is the 
> complexity of: writing code that creates the introductory 
> description of group members and their placement criteria; modifying
> the orchestration code to refer to the results of the introductory 
> phases and to keep the orchestration lease alive; and writing code 
> that does the final confirmation. 
> 
> The additional input size is the size of the needed descriptions of 
> group members (i.e., what capacities they need).  If this is step 2 
> in a roadmap then the placement criteria are already being described
> (no new/different input is needed for that --- which is an advantage
> of this roadmap) once; depending on details the new API would take 
> one or two copies of each placement criterion. 
> 
> Note that a user who is using Heat has already written such a 
> description --- a heat template already contains descriptions of 
> group members and their placement criteria.  A Heat user would also 
> escape the development burden of using the expanded API, if Heat 
> covers that (either in upstream code or in provider plugins). 
> 
> 
> (i3) Single service vs. cross service.  Issues (i1) and (i2) appear 
> even when you consider only one service.  Those issues get more 
> intense when you want placement criteria that involve multiple 
> services (e.g., Cinder and Nova).  I think we already have a well 
> established agreement that (i3) belongs last on the roadmap. 
> 
> 
> Thanks,
> Mike_______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140820/f0a517c8/attachment.html>


More information about the OpenStack-dev mailing list