[openstack-dev] [heat] [scheduler] Bringing things together for Icehouse
Mike Spreitzer
mspreitz at us.ibm.com
Sun Sep 15 07:19:48 UTC 2013
I've read up on recent goings-on in the scheduler subgroup, and have some
thoughts to contribute.
But first I must admit that I am still a newbie to OpenStack, and still am
missing some important clues. One thing that mystifies me is this: I see
essentially the same thing, which I have generally taken to calling
holistic scheduling, discussed in two mostly separate contexts: (1) the
(nova) scheduler context, and (2) the ambitions for heat. What am I
missing?
I have read the Unified Resource Placement Module document (at
https://docs.google.com/document/d/1cR3Fw9QPDVnqp4pMSusMwqNuB_6t-t_neFqgXA98-Ls/edit?pli=1#
) and NovaSchedulerPerspective document (at
https://docs.google.com/document/d/1_DRv7it_mwalEZzLy5WO92TJcummpmWL4NWsWf0UWiQ/edit?pli=1#heading=h.6ixj0ctv4rwu
). My group already has running code along these lines, and thoughts for
future improvements, so I'll mention some salient characteristics. I have
read the etherpad at
https://etherpad.openstack.org/IceHouse-Nova-Scheduler-Sessions - and I
hope my remarks will help fit these topics together.
Our current code uses one long-lived process to make placement decisions.
The information it needs to do this job is pro-actively maintained in its
memory. We are planning to try replacing this one process with a set of
equivalent processes, not sure how well it will work out (we are a
research group).
We make a distinction between desired state, target state, and observed
state. The desired state comes in through REST requests, each giving a
full virtual resource topology (VRT). A VRT includes constraints that
affect placement, but does not include actual placement decisions. Those
are made by what we call the placement agent. Yes, it is separate from
orchestration (even in the first architecture figure in the u-rpm document
the orchestration is separate --- the enclosing box does not abate the
essential separateness). In our architecture, orchestration is downstream
from placement (as in u-rpm). The placement agent produces target state,
which is essentially desired state augmented by placement decisions.
Observed state is what comes from the lower layers (Software Defined
Compute, Storage, and Network). We mainly use OpenStack APIs for the
lower layers, and have added a few local extensions to make the whole
story work.
The placement agent judges available capacity by subtracting current
allocations from raw capacity. The placement agent maintains in its
memory a derived thing we call effective state; the allocations in
effective state are the union of the allocations in target state and the
allocations in observed state. Since the orchestration is downstream,
some of the planned allocations are not in observed state yet. Since
other actors can use the underlying cloud, and other weird sh*t happens,
not all the allocations are in target state. That's why placement is done
against the union of the allocations. This is somewhat conservative, but
the alternatives are worse.
Note that placement is concerned with allocations rather than current
usage. Current usage fluctuates much faster than you would want placement
to. Placement needs to be done with a long-term perspective. Of course,
that perspective can be informed by usage information (as well as other
sources) --- but it remains a distinct thing.
We consider all our copies of observed state to be soft --- they can be
lost and reconstructed at any time, because the true source is the
underlying cloud. Which is not to say that reconstructing a copy is
cheap. We prefer making incremental updates as needed, rather than
re-reading the whole thing. One of our local extensions adds a mechanism
by which a client can register to be notified of changes in the Software
Defined Compute area.
The target state, on the other hand, is stored authoritatively by the
placement agent in a database.
We pose placement as a constrained optimization problem, with a non-linear
objective. We approximate its solution with a very generic algorithm; it
is easy to add new kinds of constraints and new contributions to the
objective.
The core placement problem is about packing virtual resources into
physical containers (e.g., VMs into hosts, volumes into Cinder backends).
A virtual resource has a demand vector, and a corresponding container has
a capacity vector of the same length. For a given container, the sum of
the demand vectors of the virtual resources in that container can not
exceed the container's capacity vector in any dimension. We can add
dimensions as needed to handle the relevant host/guest characteristics.
We are just now working an example where a Cinder volume can be required
to be the only one hosted on whatever Cinder backend hosts it. This is
exactly analogous to requiring that a VM (bare metal or otherwise) be the
only one hosted by whatever PM hosts it.
We favor a fairly expressive language for stating desired policies and
relationships in VRTs. We think this is necessary when you move beyond
simple examples to more realistic ones. We do not favor chopping the
cloud up into little pieces due to inexpressiveness in the VRT language.
Regards,
Mike
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20130915/7281182c/attachment.html>
More information about the OpenStack-dev
mailing list