[openstack-dev] [scheduler] [heat] Policy specifics

Mike Spreitzer mspreitz at us.ibm.com
Fri Sep 27 18:51:20 UTC 2013


Clint Byrum <clint at fewbar.com> wrote on 09/27/2013 11:58:16 AM:

> From: Clint Byrum <clint at fewbar.com>
> To: openstack-dev <openstack-dev at lists.openstack.org>, 
> Date: 09/27/2013 12:01 PM
> Subject: Re: [openstack-dev] [scheduler] [heat] Policy specifics
> 
...
> > Mike,
> > These are not the kinds of specifics that are of any help at all in 
> > figuring out how (or, indeed, whether) to incorporate holistic 
> > scheduling into OpenStack.
> 
> I agree that the things in that page are a wet dream of logical 
deployment
> fun. However, I think one can target just a few of the basic ones,
> and see a real achievable case forming. I think I grasp Mike's ideas,
> so I'll respond to your concerns with what I think. Note that it is
> highly likely I've gotten some of this wrong.

It remains to be seen whether those things can be anything more than a wet 
dream for OpenStack, but they are running code elsewhere, so I have hope. 
What I wrote is pretty much a dump of what we have.  The exception is the 
network bandwidth stuff, which our holistic infrastructure scheduler 
currently ignores because we do not have a way to get the relevant 
capacity information from the physical infrastructure.  Part of the agenda 
here is to nudge Neutron to improve in that way.

> > - What would a holistic scheduling service look like? A standalone 
> > service? Part of heat-engine?
> 
> I see it as a preprocessor of sorts for the current infrastructure 
engine.
> It would take the logical expression of the cluster and either turn
> it into actual deployment instructions or respond to the user that it
> cannot succeed. Ideally it would just extend the same Heat API.

My own expectation is that it would be its own service, preceding 
infrastructure orchestration in the flow.  Alternatively, we could bundle 
holistic infrastructure scheduling, infrastructure orchestration, and 
software orchestration preparation together under one API but still 
maintained as fairly separate modules of functionality.  Or various in 
between ideas.  I do not yet have a strong reason for one choice over 
another.  I have been looking to gain cluefulness from discussion with you 
folks.

> > - How will the scheduling service reserve slots for resources in 
advance 
> > of them being created? How will those reservations be accounted for 
and 
> > billed?
> > - In the event that slots are reserved but those reservations are not 
> > taken up, what will happen?
> 
> I dont' see the word "reserve" in Mike's proposal, and I don't think 
this
> is necessary for the more basic models like Collocation and 
Anti-Collocation.
> 
> Reservations would of course make the scheduling decisions more likely 
to
> succeed, but it isn't necessary if we do things optimistically. If the
> stack create or update fails, we can retry with better parameters.

The raw truth of the matter is that even Nova has this problem already. 
The real ground truth of resource usage is in the hypervisor, not Nova. 
When Nova makes a decision, it really is provisional until confirmed by 
the hypervisor.  I have heard of cases, in different cloud software, where 
the thing making the placement decisions does not have a truly accurate 
picture of the resource usage.  These are typically caused by corner cases 
in failure scenarios, where the decision maker thinks something did not 
happen or was successfully deleted but in reality there is a zombie left 
over consuming some resources in the hypervisor.  There are probably cases 
where this can happen in OpenStack too, I am guessing.  Also, OpenStack 
does not prevent someone from going around Nova and directly asking a 
hypervisor to do something.

> > - Once scheduled, how will resources be created in their proper slots 
as 
> > part of a Heat template?
> 
> In goes a Heat template (sorry for not using HOT.. still learning it. ;)
> 
> Resources:
>   ServerTemplate:
>     Type: Some::Defined::ProviderType
>   HAThing1:
>     Type: OS::Heat::HACluster
>     Properties:
>       ClusterSize: 3
>       MaxPerAZ: 1
>       PlacementStrategy: anti-collocation
>       Resources: [ ServerTemplate ]
> 
> And if we have at least 2 AZ's available, it feeds to the heat engine:
> 
> Resources:
>   HAThing1-0:
>     Type: Some::Defined::ProviderType
>       Parameters:
>         availability-zone: zone-A
>   HAThing1-1:
>     Type: Some::Defined::ProviderType
>       Parameters:
>         availability-zone: zone-B
>   HAThing1-2:
>     Type: Some::Defined::ProviderType
>       Parameters:
>         availability-zone: zone-A
> 
> If not, holistic scheduler says back "I don't have enough AZ's to
> satisfy MaxPerAZ".

Actually, I was thinking something even simpler (in the simple cases :-). 
By simple cases I mean where the holistic infrastructure scheduler makes 
all the placement decisions.  In that case, it only needs to get Nova to 
implement the decisions already made.  So the API call or template 
fragment for a VM instance would include an AZ parameter that specifies 
the particular host already chosen for that VM instance.  Similarly for 
Cinder, except that its handling of AZ has been broken.  But I hear that 
is or will be fixed.  In the meantime it is possible to abuse volume types 
to get this job done.

> Now, if Nova grows anti-affininty under the covers that it can manage
> directly, a later version can just spit out:
> 
> Resources:
>   HAThing1-0:
>     Type: Some::Defined::ProviderType
>       Parameters:
>         instance-group: 0
>         affinity-type: anti
>   HAThing1-1:
>     Type: Some::Defined::ProviderType
>       Parameters:
>         instance-group: 1
>         affinity-type: anti
>   HAThing1-2:
>     Type: Some::Defined::ProviderType
>       Parameters:
>         instance-group: 0
>         affinity-type: anti

Yes, if there are no strong interactions between the placement of certain 
VMs and other non-Nova placement decisions then the placement 
decision-making for those VMs can be deferred to Nova (provided Nova has a 
rich enough interface).  I do not follow exactly your thinking in your 
example, but I think we are agreed on the principle.

> The point is that the user cares about their servers not being in the
> same failure domain, not how that happens.

Right.  As you point out later, that is kind of the big picture here.

> > - What about when the user calls the APIs directly? (i.e. does their 
own 
> > orchestration - either hand-rolled or using their own standalone 
Heat.)
> 
> This has come up with autoscaling too. "Undefined" - that's not your 
stack.

This may be another face of the concern behind "reservation".  There is a 
significant issue to discuss around how a holistic infrastructure 
scheduler (indeed, any scheduler really) interacts with something that 
goes around it to the underlying resources.  It is tempting to suggest 
multiple independent managers can somehow cooperate in managing a common 
pool of resources, but this rarely works out well.  I think the practical 
solution is to focus on one manager for any given resource.  But that 
manager must cope somewhat gracefully with surprises because they can 
happen (as I mentioned above).

> > - How and from where will the scheduling service obtain the 
utilisation 
> > data needed to perform the scheduling? What mechanism will segregate 
> > this information from the end user?
> 
> I do think this is a big missing piece. Right now it is spread out
> all over the place. Keystone at least has regions, so that could be
> incorporated now. I briefly dug through the other API's and don't see
> a way to enumerate AZ's or cells. Perhaps it is hiding in extensions?
> 
> I don't think this must be segregated from end users. An API for "show
> me the placement decisions I can make" seems useful for anybody trying
> to automate deployments. Anyway, probably best to keep it decentralized
> and just make it so that each service can respond with lists of 
arguments
> to their API that are likely to succeed.

First, I think "utilization" is not the best word for what matters here. 
CPU utilization, for example, is something that fluctuates fairly quickly. 
 VM placement should be based on long-term allocation decisions.  Those 
might be informed by utilization information, but they are a distinct 
thing.

Isn't it true today that Nova packs VMs onto a hypervisor by comparing 
virtual CPUs with real CPUs (multiplied by some configured overcommittment 
factor)?  That is an example of the sort of allocation-based decision 
making I am talking about.  It does not require new utilization 
information from anyone; it requires the scheduler to keep track of the 
allocations it has made --- and the allocations it has discovered someone 
else has made too.  For the latter, I think two mechanisms are good. 
First, the underlying resource service should be able to report all the 
allocations (e.g., the nova compute agents should be able to report what 
VM instances are already running, regardless of who started them). Second, 
recognize that any software can get confused, including the resource 
service; the scheduler should be able to formulate and keep track of an 
adjustment to what the resource service is saying.  This second point is a 
fine point, no need to worry about it at first if Nova is not already 
doing such a thing.

> > - How will users communicate their scheduling constraints to 
OpenStack? 
> > (Through which API and in what form?)
> 
> See above. Via the Heat API, a Heat-ish template that is turned into
> concrete Heat instructions.

Yes, this seems like a repeat of above discussion.

> > - What value does this provide (over and above non-holistic scheduler 
> > hints passed in individual API calls) to end users? Public cloud 
> > operators? Private cloud operators? How might the value be shared 
> > between users and operators, and how would that be accounted for?
> 
> See above, logically expressing what you actually want means the tool
> can improve its response to that. Always expressing things concretely
> means that improvements on the backend are harder to realize.
> 
> Ultimately, it is an end-user tool, but the benefit to a cloud operator
> could be significant.  If one AZ is getting flooded, one can stop
> responding to it, or add hints in the API ranking the AZ lower than
> the others in preference. Users using the holistic scheduler will begin
> using the new AZ without having to be educated about it.

Yes.  Even though a public cloud would not expose all the information that 
a holistic infrastructure scheduler needs to do its job (rather the 
holistic infrastructure scheduler would be part of the public cloud's 
service), it can accept templates that involve holistic infrastructure 
scheduling.  It is a separation of concerns play: the template author says 
what he wants without getting overly specific about how it gets done.

> > - Does this fit within the scope of an existing OpenStack program? 
Which 
> > one? Why?
> 
> Heat. You want users to use holistic scheduling when it can work for 
them,
> so having it just be a tweak to their templates is a win.

I think this is actually a pretty interesting question.  If we recognize 
that the heat program has a bigger view (all kinds of orchestration) than 
today's heat engine (infrastructure orchestration), this can partly help 
untangle heat/not-heat debate.   Holistic infrastructure scheduling is a 
form of scheduling, and the nova scheduler group has some interest in it, 
but it is inherently not limited to nova.  I think it fits best between 
software orchestration preparation and infrastructure orchestration --- in 
the middle of the interests of the heat program. I think we may want to 
recognize that the best flow of processing does not necessarily intersect 
the interests of a given program in only one contiguous region.

> > - What changes are required to existing services to accommodate this 
> > functionality?
> > 
> 
> More exposure of what details can be exploited.

Yes, and the level of control needed to do that exploitation.  The 
meta-model in that policy document identifies the level of details 
involved.

Regards,
Mike
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20130927/e31e45ef/attachment.html>


More information about the OpenStack-dev mailing list