<font size=2 face="sans-serif">This is primarily an issue for Nova.</font>

<br>

<br><tt><font size=2>Mike Spreitzer/Watson/IBM@IBMUS wrote on 08/20/2014

01:21:24 AM:<br>

<br>

> From: Mike Spreitzer/Watson/IBM@IBMUS</font></tt>

<br><tt><font size=2>> To: "OpenStack Development Mailing List"

<openstack-dev@lists.openstack.org>, </font></tt>

<br><tt><font size=2>> Date: 08/20/2014 01:24 AM</font></tt>

<br><tt><font size=2>> Subject: [openstack-dev] Separating the issues

around smarter/<br>

> solver/joint/holistic placement</font></tt>

<br><tt><font size=2>> <br>

> There has been a lot of discussion around these issues, let me see

<br>

> if I can break it down into pieces, hopefully in a way that allows

<br>

> some progress on one of them first. <br>

> <br>

> I continue to focus on the timeless version of the problem, in which<br>

> the placement question is simply where can we put some guest(s) now

<br>

> without making plans for whether/how things may change in the <br>

> future.  There is also a rich discussion to have about placement

<br>

> over time (in other communities this is what "scheduling"

means), <br>

> but that is not my goal here. <br>

> <br>

> I see essentially three issues: (i1) more sophisticated placement

<br>

> criteria, (i2) joint vs sequential decision making, and (i3) single-<br>

> service vs cross service.  I think we could make a roadmap for

<br>

> addressing them in this order.  Do you think the issues are <br>

> separable and attackable in this way? <br>

> <br>

> Objections have been raised regarding "complexity", and

I would like<br>

> to be clear about what complexity we are discussing.  For each

of <br>

> these three issues, I can think of five kinds of complexity to <br>

> discuss: the development complexity faced by us OpenStack developers<br>

> and/or a cloud provider who wants to add such things, the runtime

<br>

> costs, the API complexity for allowing the issue to be addressed,

<br>

> the development complexity faced by the cloud users in writing their<br>

> placement problems, and the size of the written problems. <br>

> <br>

> (i1) More sophisticated placement criteria.  For example, I think

<br>

> cloud providers would like to have a way to express a preference for<br>

> consolidation, to help them conserve energy.  That's a fairly

rich <br>

> topic in itself, and I hope the details need not concern us here.

 <br>

> OTOH, cloud users may want to have a way to express a preference for<br>

> spreading their guests out --- for example, spreading around within

<br>

> an availability zone.  This may seem relatively boring for stateless<br>

> web servers, but if you think about things like Zookeeper or <br>

> Cassandra servers you see a serious preference for getting as much

<br>

> failure independence as you can get within a given AZ.  Other

<br>

> examples of what cloud users may want cross services, like placing

<br>

> VMs and storage volumes in close proximity (by some measure). <br>

> <br>

> Let's consider a couple of concrete examples from the record.  One

<br>

> is referenced in the current revision of https://<br>

> review.openstack.org/#/c/96543/ - namely, </font></tt><a href=http://goo.gl/vuHcz2><tt><font size=2>http://goo.gl/vuHcz2</font></tt></a><tt><font size=2>

.  <br>

> There Yathi exhibits a binary integer programming problem for <br>

> placing a group of guests so as to minimize the sum of (host's free

<br>

> RAM before any members of this group are placed).  Minimizing

that <br>

> objective has a roughly consolidating effect in many situations, <br>

> even though it does not directly express exactly consolidation <br>

> (exactly expressing consolidation requires a non-linear objective).

<br>

> Similarly, maximizing that objective has something of a tendency to

<br>

> spread things out, even though the objective does not exactly <br>

> express spreading.  (BTW, I do not think we should assume that

all <br>

> hosts in an AZ have the same capacity vector.) <br>

> <br>

> Another example is at </font></tt><a href=https://wiki.openstack.org/wiki/Heat/><tt><font size=2>https://wiki.openstack.org/wiki/Heat/</font></tt></a><tt><font size=2><br>

> PolicyExtension#LLMN_Anti-Collocation .  There we see a criterion

<br>

> that is a precise spreading condition, for spreading a group of <br>

> guests to a certain precise degree.  (BTW, I tend to use <br>

> "collocation" and "anti-collocation" when speaking

of affinity <br>

> stated in terms of location --- as opposed, for example, to network

<br>

> proximity.) <br>

> <br>

> For any collection of more or less sophisticated placement criteria,<br>

> if we consider the question of placing a single guest --- supposing

<br>

> it is stipulated that some guests have already had their placement

<br>

> determined and others will be determined later --- the question <br>

> boils down to the exactly the terms that we have in Nova's <br>

> FilterScheduler today: (a) which hosts can host this guest and (b)

<br>

> ranking the acceptable hosts.  Of course, that is not the only

way <br>

> to address a problem of placing several guests --- but it is one <br>

> valid way, even if the placement problem is stated as a mixed <br>

> integer programming problem.  Group placement problems are NP

hard, <br>

> so solving them exactly is out of the question.  The only question

<br>

> is how hard are we going to work at making a good approximation; <br>

> this is what (i2) is about, and we will get to that below. <br>

> <br>

> Let us review the two concrete examples with an eye towards the five<br>

> kinds of complexity.  Let us start with Yathi's RAM minimization

<br>

> binary integer programming example.  There is a difficulty if

we try<br>

> to work that example in today's framework.  The objective is

a <br>

> function of the amount of RAM that was free on each host *before any<br>

> group members are placed*.  Today's FilterScheduler works on

one <br>

> guest at a time, updating host states as it goes along; it does not

<br>

> offer earlier host states for use in later decisions.  However,

<br>

> today we can solve a slightly different problem --- which happens

to<br>

> be an even better approximation of a consolidation problem.  That

is<br>

> to minimize the sum of (host's available RAM just before the guest

<br>

> is placed), rather than the sum of (host's available RAM before any

<br>

> guests in the group are placed).  Today's Nova already has a

weigher<br>

> that the cloud provider can configure to do the revised sum.  The

<br>

> code complexity of that weigher is code that evaluates the chosen

<br>

> weight function for a given guest on a given host.  The runtime

cost<br>

> is the cost of doing that evaluation for every guest in the group,

<br>

> for every acceptable host, once.  The additional API complexity

is <br>

> nil, since no changes are required.  There are no complexities

for <br>

> cloud users in this case, since it is a cloud provider matter. <br>

> <br>

> Let us consider the LLMN anti-collocation statement applied to a <br>

> group --- let's say a Nova server group.  This could be handled

in <br>

> today's FilterScheduler by a filter that, when there is an LLMN <br>

> anti-collocation statement in the scheduler hints, tests whether a

<br>

> proposed placement is consistent with the criterion and the <br>

> placements already chosen for that group.  This could be similar

to <br>

> the existing sort of logic we see in today's various affinity <br>

> filters.  That sort of logic evaluates the condition directly,

<br>

> without using anything cached for the group; this makes the runtime

<br>

> cost higher than necessary.  These runtime costs are those of

a <br>

> triply nested loop over all group members, over all otherwise-<br>

> acceptable hosts, and over all group members already placed --- that<br>

> is, about 0.5*H*G^2 iterations of the innermost loop body if H is

<br>

> the average number of otherwise-acceptable hosts and G is the number<br>

> of guests in the group.  If the code framework were expanded

so that<br>

> the filter could know the decisions made by the FilterScheduler and

<br>

> cache the number of group members already placed in a given L1 zone,<br>

> the runtime complexity could be reduced to H*G iterations of a <br>

> different innermost loop body.  The additional API complexity

is nil<br>

> if we leave the framework alone, otherwise it is the aforementioned

<br>

> expansion of the SPI for filters.  The development complexity

for a <br>

> cloud user consists of picking the L1, L2, and N parameters for a

<br>

> given group, and the size of the statement is the size of those <br>

> parameters and their binding to the group. <br>

> <br>

> <br>

> (i2) Joint (AKA simultaneous) vs. sequential decision-making.  The

<br>

> change here is about changing the placement framework so that it <br>

> can, in one invocation, choose the placement for each member of a

<br>

> general (i.e., not necessarily homogenous) group of guests.  The

<br>

> motivation for this is the observation that sequential decision <br>

> making can paint itself into a corner, since it has no view of where<br>

> it is going.  You might think that this is a non-issue if your

AZ <br>

> has a comfortable amount of spare capacity, but that's not quite <br>

> right --- it is individual hosts, rather than an AZ as a whole, that<br>

> run out of capacity.  Consider an example in which the cloud

<br>

> provider has a preference for consolidation --- but it is of <br>

> secondary importance, the cloud provider wants to put the cloud <br>

> users' criteria first.  Suppose a cloud user has a group of 10

<br>

> guests that he wants collocated on the same host.  Suppose the

cloud<br>

> has several hosts that are full, one that has spare capacity for 5

<br>

> such guests, and some hosts that are empty and have capacity for 20

<br>

> such guests each.  Sequential decision making is going to put

the <br>

> first 5 guests on that partially full host, and then be in a tight

<br>

> spot (live migration has costs, and is disallowed in some <br>

> situations).  Joint decision-making, OTOH, will put all 10 guests

on<br>

> the same formerly empty host. <br>

> <br>

> The development complexity of switching to joint decision making is

<br>

> illustrated by the existing change for that --- https://<br>

> review.openstack.org/#/c/46588/ . <br>

> <br>

> The runtime costs are dependent on input patterns, the way input is

<br>

> translated into an optimization problem, and the solver used.  <br>

> Typically we use H*G binary decision variables, which puts a lower

<br>

> bound of H*G on runtime costs.  Even for a linear objective,

runtime<br>

> complexity is exponential in the worst case (the problem is NP hard,<br>

> remember) --- but runtime complexity is often much lower in, for <br>

> example, the Simplex method.  I hope Yathi can chime in here

with <br>

> some more detailed statement about what he has seen. <br>

> <br>

> My own group has been working with non-linear objectives and a very

<br>

> flexible solver based on biased sampling.  Its complexity is

O(G*(H<br>

> +K)) where G and H are as above and K is the number of pairwise <br>

> constraints (the only kind understood by this solver); the leading

<br>

> coefficient is more than one.  In a not-so recent paper (more

recent<br>

> results are better), my colleague Asser Tantawi was getting decent

<br>

> results from runs that evaluate between 10^2 and 10^3 samples; here

<br>

> a sample is a candidate joint placement (containing a candidate host<br>

> for each guest in the group). <br>

> <br>

> The client API complexity of switching to joint decision making is

<br>

> centered on the need to present a whole group placement problem at

<br>

> once.  If we leave orchestration to the clients (I detect a strong

<br>

> preference for this in the community, although I personally think

<br>

> alternatives could be viable), this requires introducing <br>

> introductory phases in which the group members and their placement

<br>

> criteria are sufficiently described and then the joint decision is

<br>

> made.  To properly cope with the fact that stuff can go wrong

during<br>

> orchestration, this would benefit from also adding a final <br>

> confirmation phase in which the client declares that it has done all<br>

> it is going to do about the group at hand --- and this, in turn, <br>

> requires an orchestration leasing mechanism to make the final phase

<br>

> eventually happen even if the client aborts.  The API in question

<br>

> does not have to be the Nova API.  There is current work on <br>

> separating out scheduling.  But regardless of whether it is part

of <br>

> Nova or something else, there would need to be an API along these

lines. <br>

> <br>

> Part of the API complexity is about providing some way for the <br>

> chosen placement decisions to get from the introductory phase into

<br>

> the orchestration phase (where the guests are actually put in the

<br>

> chosen places).  I can see two approaches.  One is for the

<br>

> introductory phase to return placement decisions to the client, and

<br>

> the client pass to them into the guest create/update operations in

<br>

> the orchestration phase.  Another is based on using a unique

<br>

> identifier for a group member in both the introductory and <br>

> orchestration phase, so that the placement decision can travel under<br>

> the covers while the client deals with only the identifier. <br>

> <br>

> There are also SPI changes needed, so that the joint problem is <br>

> preserved all the way from the API into the scheduler.  Additionally<br>

> this needs a way to make a joint claim on host capacity (there is

a <br>

> potential connection to Blazar here). <br>

> <br>

> The additional complexity for developers of clients is the <br>

> complexity of: writing code that creates the introductory <br>

> description of group members and their placement criteria; modifying<br>

> the orchestration code to refer to the results of the introductory

<br>

> phases and to keep the orchestration lease alive; and writing code

<br>

> that does the final confirmation. <br>

> <br>

> The additional input size is the size of the needed descriptions of

<br>

> group members (i.e., what capacities they need).  If this is

step 2 <br>

> in a roadmap then the placement criteria are already being described<br>

> (no new/different input is needed for that --- which is an advantage<br>

> of this roadmap) once; depending on details the new API would take

<br>

> one or two copies of each placement criterion. <br>

> <br>

> Note that a user who is using Heat has already written such a <br>

> description --- a heat template already contains descriptions of <br>

> group members and their placement criteria.  A Heat user would

also <br>

> escape the development burden of using the expanded API, if Heat <br>

> covers that (either in upstream code or in provider plugins). <br>

> <br>

> <br>

> (i3) Single service vs. cross service.  Issues (i1) and (i2)

appear <br>

> even when you consider only one service.  Those issues get more

<br>

> intense when you want placement criteria that involve multiple <br>

> services (e.g., Cinder and Nova).  I think we already have a

well <br>

> established agreement that (i3) belongs last on the roadmap. <br>

> <br>

> <br>

> Thanks,<br>

> Mike_______________________________________________<br>

> OpenStack-dev mailing list<br>

> OpenStack-dev@lists.openstack.org<br>

> </font></tt><a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev"><tt><font size=2>http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</font></tt></a><tt><font size=2><br>

</font></tt>