[openstack-dev] [Ironic] Node groups and multi-node operations
Devananda van der Veen
devananda.vdv at gmail.com
Sun Jan 26 18:27:36 UTC 2014
On Sat, Jan 25, 2014 at 7:11 AM, Clint Byrum <clint at fewbar.com> wrote:
> Excerpts from Robert Collins's message of 2014-01-25 02:47:42 -0800:
> > On 25 January 2014 19:42, Clint Byrum <clint at fewbar.com> wrote:
> > > Excerpts from Robert Collins's message of 2014-01-24 18:48:41 -0800:
> > >> > However, in looking at how Ironic works and interacts with Nova, it
> > >> > doesn't seem like there is any distinction of data per-compute-node
> > >> > inside Ironic. So for this to work, I'd have to run a whole bunch of
> > >> > ironic instances, one per compute node. That seems like something we
> > >> > don't want to do.
> > >>
> > >> Huh?
> > >>
> > >
> > > I can't find anything in Ironic that lets you group nodes by anything
> > > except chassis. It was not a serious discussion of how the problem would
> > > be solved, just a point that without some way to tie ironic nodes to
> > > compute-nodes I'd have to run multiple ironics.
> > I don't understand the point. There is no tie between ironic nodes and
> > compute nodes. Why do you want one?
> Because sans Ironic, compute-nodes still have physical characteristics
> that make grouping on them attractive for things like anti-affinity. I
> don't really want my HA instances "not on the same compute node", I want
> them "not in the same failure domain". It becomes a way for all
> OpenStack workloads to have more granularity than "availability zone".
Yes, and with Ironic, these same characteristics are desirable but are
no longer properties of a nova-compute node; they're properties of the
hardware which Ironic manages.
In principle, the same (hypothetical) failure-domain-aware scheduling
could be done if Ironic is exposing the same sort of group awareness,
as long as the nova 'ironic" driver is passing that information up to
the scheduler in a sane way. In which case, Ironic would need to be
representing such information, even if it's not acting on it, which I
think is trivial for us to do.
> So if we have all of that modeled in compute-nodes, then when adding
> physical hardware to Ironic one just needs to have something to model
> the same relationship for each physical hardware node. We don't have to
> do it by linking hardware nodes to compute-nodes, but that would be
> doable for a first cut without much change to Ironic.
You're trading failure-domain awareness for fault-tolerance in your
control plane. by binding hardware to nova-compute. Ironic is designed
explicitly to decouple the instances of Ironic (and Nova) within the
control plane from the hardware it's managing. This is one of the main
shortcomings of nova baremetal, and it doesn't seem like a worthy
trade, even for a first approximation.
> > >> The changes to Nova would be massive and invasive as they would be
> > >> redefining the driver api....and all the logic around it.
> > >>
> > >
> > > I'm not sure I follow you at all. I'm suggesting that the scheduler have
> > > a new thing to filter on, and that compute nodes push their unique ID
> > > down into the Ironic driver so that while setting up nodes in Ironic one
> > > can assign them to a compute node. That doesn't sound massive and
> > > invasive.
This is already being done *within* Ironic as nodes are mapped
dynamically to ironic-conductor instances; the coordination for
failover/takeover needs to be improved, but that's incremental at this
point. Moving this mapping outside of Ironic is going to be messy and
complicated, and breaks the abstraction layer. The API change may seem
small, but it will massively overcomplicate Nova by duplicating all
the functionality of Ironic-conductor in another layer of the stack.
> > I think we're perhaps talking about different things - in the section
> > you were answering, I thought he was talking about whether the API
> > should offer operations on arbitrary sets of nodes at once, or whether
> > each operation should be a separate API call vs what I now think you
> > were talking about which was whether operations should be able to
> > describe logical relations to other instances/nodes. Perhaps if we use
> > the term 'batch' rather than 'group' to talk about the
> > multiple-things-at-once aspect, and grouping to talk about the
> > primarily scheduler related problems of affinity / anti affinity etc,
> > we can avoid future confusion.
> Yes, thats a good point. I was talking about modeling failure domains
> only. Batching API requests seems like an entirely different thing.
I was conflating these terms in that I was talking about "grouping
actions" (batching) and "groups of nodes" (groups). That said, there
are really three distinct topics here. Let's break groups down
further: "logical group" for failure domains, and "hardware group" for
hardware which is physically interdependent in such a way that changes
to one node affect other node(s).
More information about the OpenStack-dev