[openstack-dev] [nova][scheduler] Instance Group Model and APIs - Updated document with an example request payload
Mike Spreitzer
mspreitz at us.ibm.com
Wed Oct 30 04:11:04 UTC 2013
Following is my reaction to the last few hours of discussion.
Russell Bryant wrote "Nova calling heat to orchestrate Nova seems
fundamentally wrong". I am not totally happy about this either, but would
you be OK with Nova orchestrating Nova? To me, that seems worse ---
duplicating functionality we already have in Heat. The way I see it, we
have to decide how cope with the inescapable fact that orchestration is
downstream from joint decision making. I see no better choices than: (1)
a 1-stage API in which the client presents the whole top-level group and
is done, or (2) a 2-stage API in which the client first presents the whole
top-level group and second proceeds to orchestrate the creations of the
resources in that group. BTW, when we go holistic, (1) will look less
offensive: there will be a holistic infrastructure scheduler doing the
joint decision making first, not one of the individual services, and that
is followed by orchestration of the individual resources. If we took Alex
Glikson's suggestion and started holistic, we would not be so upset on
this issue.
Alex also wrote:
``I wonder whether it is possible to find an approach that takes into
account cross-resource placement considerations (VM-to-VM communicating
over the application network, or VM-to-volume communicating over storage
network), but does not require delivering all the intimate details of the
entire environment to a single place -- which probably can not be either
of Nova/Cinder/Neutron/etc.. but can we still use the individual
schedulers in each of them with partial view of the environment to drive a
placement decision which is consistently better than random?''
I think you could create a cross-scheduler protocol that would accomplish
joint placement decision making --- but would not want to. It would
involve a lot of communication, and the subject matter of that
communication would be most of what you need in a centralized placement
solver anyway. You do not need "all the intimate details", just the bits
that are essential to making the placement decision.
Reacting to Andrew Lasky's note, Chris Friesen noted:
``As soon as we start trying to do placement logic outside of Nova it
becomes trickier to deal with race conditions when competing against
other API users trying to acquire resources at the same time.''
I have two reactions. The simpler one is: we can avoid this problem if we
simply route all placement problems (either all placement problems for
Compute, or all placement problems for a larger set of services) though
one thing that decides and commits allocations. My other reaction is: we
will probably want multi-engine. That is, the option to run several
placement solvers concurrently --- with optimistic concurrency control.
That presents essentially the same problem as Chris noted. As Yathi noted
in one of his responses, this can be handled by appropriate implementation
structure. In the spring I worked out a multi-engine design for my
group's old code. The conclusion I reached is that after a placement
engine finds a solution, you want an essentially ACID transaction that (1)
checks that the solution is still valid and, if so, (2) makes the
allocations in that solution.
Yathi wrote that the 2-stage API creates race conditions, but I do not see
that. As we are starting with Nova only, in the first of the two stages
Nova can both decide and commit the allocations in one transaction; the
second stage just picks up and uses the allocations made in the first
stage.
Alex Glikson asked why not go directly to holistic if there is no value in
doing Nova-only. Yathi replied to that concern, and let me add some
notes. I think there *are* scenarios in which doing Nova-only joint
policy-based scheduling is advantageous. For example, if the storage is
in SAN or NAS then you do not have a strong interaction between scheduling
compute and storage so you do not need holistic scheduling to get good
availability. I know some organizations build their datacenters that way,
with full cross-sectional bandwidth between the compute and storage,
because (among other things) it makes that simplification. Another thing
that can be done with joint policy-based scheduling is minimize license
costs for certain IBM software. That software is licensed based on how
many cores the software has access to, so in a situation with
hyperthreading or overcommitment the license cost can depend on how the VM
instances are arranged among hosts.
Yathi replied to Khanh-Toan's remark about edge policies, but I suspect
there was a misunderstanding. I think the critique concerns this part of
the input:
"policies" : [ {
"edge" : "http-app-edge-1",
"policy_uuid" : "some-policy-uuid-2",
"type" : "edge",
"policy_id" : 33333
} ],
"edges" : [ {
"r_member" : "app-server-group-1",
"l_member" : "http-server-group-1",
"name" : "http-app-edge-1"
} ],
That is, the top-level group contains a "policies" section that refers to
one of the edges, while the edges are defined in a different section. I,
and I think Khanh, would find it more natural for the edge definition to
inline its references to policies (yes, we understand these are only
references). In the example, it might look like this:
"edges" : [ {
"r_member" : "app-server-group-1",
"l_member" : "http-server-group-1",
"name" : "http-app-edge-1",
"policies : [ {
"policy_uuid" : "some-policy-uuid-2",
"policy_id" : 33333
} ],
} ],
By writing the policy references right where they apply, it is easier to
understand and it is shorter (we do not need the "type" and "edge" fields
to identify the context).
BTW, do we really want to ask the client to supply both the policy_id and
the policy_uuid in a policy reference? Isn't one of those sufficient?
Regards,
Mike
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20131030/d52c7c37/attachment.html>
More information about the OpenStack-dev
mailing list