[openstack-dev] [heat] [scheduler] Bringing things together for Icehouse

Mike Spreitzer mspreitz at us.ibm.com
Sun Sep 15 07:19:48 UTC 2013


I've read up on recent goings-on in the scheduler subgroup, and have some 
thoughts to contribute.

But first I must admit that I am still a newbie to OpenStack, and still am 
missing some important clues.  One thing that mystifies me is this: I see 
essentially the same thing, which I have generally taken to calling 
holistic scheduling, discussed in two mostly separate contexts: (1) the 
(nova) scheduler context, and (2) the ambitions for heat.  What am I 
missing?

I have read the Unified Resource Placement Module document (at 
https://docs.google.com/document/d/1cR3Fw9QPDVnqp4pMSusMwqNuB_6t-t_neFqgXA98-Ls/edit?pli=1#
) and NovaSchedulerPerspective document (at 
https://docs.google.com/document/d/1_DRv7it_mwalEZzLy5WO92TJcummpmWL4NWsWf0UWiQ/edit?pli=1#heading=h.6ixj0ctv4rwu
).  My group already has running code along these lines, and thoughts for 
future improvements, so I'll mention some salient characteristics.  I have 
read the etherpad at 
https://etherpad.openstack.org/IceHouse-Nova-Scheduler-Sessions - and I 
hope my remarks will help fit these topics together.

Our current code uses one long-lived process to make placement decisions. 
The information it needs to do this job is pro-actively maintained in its 
memory.  We are planning to try replacing this one process with a set of 
equivalent processes, not sure how well it will work out (we are a 
research group).

We make a distinction between desired state, target state, and observed 
state.  The desired state comes in through REST requests, each giving a 
full virtual resource topology (VRT).  A VRT includes constraints that 
affect placement, but does not include actual placement decisions.  Those 
are made by what we call the placement agent.  Yes, it is separate from 
orchestration (even in the first architecture figure in the u-rpm document 
the orchestration is separate --- the enclosing box does not abate the 
essential separateness).  In our architecture, orchestration is downstream 
from placement (as in u-rpm).  The placement agent produces target state, 
which is essentially desired state augmented by placement decisions. 
Observed state is what comes from the lower layers (Software Defined 
Compute, Storage, and Network).  We mainly use OpenStack APIs for the 
lower layers, and have added a few local extensions to make the whole 
story work.

The placement agent judges available capacity by subtracting current 
allocations from raw capacity.  The placement agent maintains in its 
memory a derived thing we call effective state; the allocations in 
effective state are the union of the allocations in target state and the 
allocations in observed state.  Since the orchestration is downstream, 
some of the planned allocations are not in observed state yet.  Since 
other actors can use the underlying cloud, and other weird sh*t happens, 
not all the allocations are in target state.  That's why placement is done 
against the union of the allocations.  This is somewhat conservative, but 
the alternatives are worse.

Note that placement is concerned with allocations rather than current 
usage.  Current usage fluctuates much faster than you would want placement 
to.  Placement needs to be done with a long-term perspective.  Of course, 
that perspective can be informed by usage information (as well as other 
sources) --- but it remains a distinct thing.

We consider all our copies of observed state to be soft --- they can be 
lost and reconstructed at any time, because the true source is the 
underlying cloud.  Which is not to say that reconstructing a copy is 
cheap.  We prefer making incremental updates as needed, rather than 
re-reading the whole thing.  One of our local extensions adds a mechanism 
by which a client can register to be notified of changes in the Software 
Defined Compute area.

The target state, on the other hand, is stored authoritatively by the 
placement agent in a database.

We pose placement as a constrained optimization problem, with a non-linear 
objective.  We approximate its solution with a very generic algorithm; it 
is easy to add new kinds of constraints and new contributions to the 
objective.

The core placement problem is about packing virtual resources into 
physical containers (e.g., VMs into hosts, volumes into Cinder backends). 
A virtual resource has a demand vector, and a corresponding container has 
a capacity vector of the same length.  For a given container, the sum of 
the demand vectors of the virtual resources in that container can not 
exceed the container's capacity vector in any dimension.  We can add 
dimensions as needed to handle the relevant host/guest characteristics.

We are just now working an example where a Cinder volume can be required 
to be the only one hosted on whatever Cinder backend hosts it.  This is 
exactly analogous to requiring that a VM (bare metal or otherwise) be the 
only one hosted by whatever PM hosts it.

We favor a fairly expressive language for stating desired policies and 
relationships in VRTs.  We think this is necessary when you move beyond 
simple examples to more realistic ones.  We do not favor chopping the 
cloud up into little pieces due to inexpressiveness in the VRT language.

Regards,
Mike
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20130915/7281182c/attachment.html>


More information about the OpenStack-dev mailing list