[openstack-dev] [Ironic] [Nova] continuing the "multiple compute host" discussion

James Penick jpenick at gmail.com
Tue Dec 15 00:15:42 UTC 2015


I'm very much against it.

 In my environment we're going to be depending heavily on the nova
scheduler for affinity/anti-affinity of physical datacenter constructs,
TOR, Power, etc. Like other operators we need to also have a concept of
host aggregates and availability zones for our baremetal as well. If these
decisions move out of Nova, we'd have to replicate that entire concept of
topology inside of the Ironic scheduler. Why do that?

I see there are 3 main problems:

1. Resource tracker sucks for Ironic.
2. We need compute host HA
3. We need to schedule compute resources in a consistent way.

 We've been exploring options to get rid of RT entirely. However, melwitt
suggested out that by improving RT itself, and changing it from a pull
model to a push, we skip a lot of these problems. I think it's an excellent
point. If RT moves to a push model, Ironic can dynamically register nodes
as they're added, consumed, claimed, etc and update their state in Nova.

 Compute host HA is critical for us, too. However, if the compute hosts are
not responsible for any complex scheduling behaviors, it becomes much
simpler to move the compute hosts to being nothing more than dumb workers
selected at random.

 With this model, the Nova scheduler can still select compute resources in
the way that it expects, and deployers can expect to build one system to
manage VM and BM. We get rid of RT race conditions, and gain compute HA.

-James

On Thu, Dec 10, 2015 at 4:42 PM, Jim Rollenhagen <jim at jimrollenhagen.com>
wrote:

> On Thu, Dec 10, 2015 at 03:57:59PM -0800, Devananda van der Veen wrote:
> > All,
> >
> > I'm going to attempt to summarize a discussion that's been going on for
> > over a year now, and still remains unresolved.
> >
> > TLDR;
> > --------
> >
> > The main touch-point between Nova and Ironic continues to be a pain
> point,
> > and despite many discussions between the teams over the last year
> resulting
> > in a solid proposal, we have not been able to get consensus on a solution
> > that meets everyone's needs.
> >
> > Some folks are asking us to implement a non-virtualization-centric
> > scheduler / resource tracker in Nova, or advocating that we wait for the
> > Nova scheduler to be split-out into a separate project. I do not believe
> > the Nova team is interested in the former, I do not want to wait for the
> > latter, and I do not believe that either one will be an adequate solution
> > -- there are other clients (besides Nova) that need to schedule workloads
> > on Ironic.
> >
> > We need to decide on a path of least pain and then proceed. I really want
> > to get this done in Mitaka.
> >
> >
> > Long version:
> > -----------------
> >
> > During Liberty, Jim and I worked with Jay Pipes and others on the Nova
> team
> > to come up with a plan. That plan was proposed in a Nova spec [1] and
> > approved in October, shortly before the Mitaka summit. It got significant
> > reviews from the Ironic team, since it is predicated on work being done
> in
> > Ironic to expose a new "reservations" API endpoint. The details of that
> > Ironic change were proposed separately [2] but have deadlocked.
> Discussions
> > with some operators at and after the Mitaka summit have highlighted a
> > problem with this plan.
> >
> > Actually, more than one, so to better understand the divergent viewpoints
> > that result in the current deadlock, I drew a diagram [3]. If you haven't
> > read both the Nova and Ironic specs already, this diagram probably won't
> > make sense to you. I'll attempt to explain it a bit with more words.
> >
> >
> > [A]
> > The Nova team wants to remove the (Host, Node) tuple from all the places
> > that this exists, and return to scheduling only based on Compute Host.
> They
> > also don't want to change any existing scheduler filters (especially not
> > compute_capabilities_filter) or the filter scheduler class or plugin
> > mechanisms. And, as far as I understand it, they're not interested in
> > accepting a filter plugin that calls out to external APIs (eg, Ironic) to
> > identify a Node and pass that Node's UUID to the Compute Host.  [[ nova
> > team: please correct me on any point here where I'm wrong, or your
> > collective views have changed over the last year. ]]
> >
> > [B]
> > OpenStack deployers who are using Nova + Ironic rely on a few things:
> > - compute_capabilities_filter to match node.properties['capabilities']
> > against flavor extra_specs.
> > - other downstream nova scheduler filters that do other sorts of hardware
> > matching
> > These deployers clearly and rightly do not want us to take away either of
> > these capabilities, so anything we do needs to be backwards compatible
> with
> > any current Nova scheduler plugins -- even downstream ones.
> >
> > [C] To meet the compatibility requirements of [B] without requiring the
> > nova-scheduler team to do the work, we would need to forklift some parts
> of
> > the nova-scheduler code into Ironic. But I think that's terrible, and I
> > don't think any OpenStack developers will like it. Furthermore, operators
> > have already expressed their distase for this because they want to use
> the
> > same filters for virtual and baremetal instances but do not want to
> > duplicate the code (because we all know that's a recipe for drift).
> >
> > [D]
> > What ever solution we devise for scheduling bare metal resources in
> Ironic
> > needs to perform well at the scale Ironic deployments are aiming for (eg,
> > thousands of Nodes) without the use of Cells. It also must be integrable
> > with other software (eg, it should be exposed in our REST API). And it
> must
> > allow us to run more than one (active-active) nova-compute process, which
> > we can't today.
> >
> >
> > OK. That's a lot of words... bear with me, though, as I'm not done yet...
> >
> > This drawing [3] is a Venn diagram, but not everything overlaps. The Nova
> > and Ironic specs [0],[1] meet the needs of the Nova team and the Ironic
> > team, and will provide a more performant, highly-available solution, that
> > is easier to use with other schedulers or datacenter-management tools.
> > However, this solution does not meet the needs of some current OpenStack
> > Operators because it will not support Nova Scheduler filter plugins.
> Thus,
> > in the diagram, [A] and [D] overlap but neither one intersects with [B].
> >
> >
> > Summary
> > --------------
> >
> > We have proposed a solution that fits ironic's HA model into
> nova-compute's
> > failure domain model, but that's only half of the picture -- in so doing,
> > we assumed that scheduling of bare metal resources was simplistic when,
> in
> > fact, it needs to be just as rich as the scheduling of virtual resources.
> >
> > So, at this point, I think we need to accept that the scheduling of
> > virtualized and bare metal workloads are two different problem domains
> that
> > are equally complex.
> >
> > Either, we:
> > * build a separate scheduler process in Ironic, forking the Nova
> scheduler
> > as a starting point so as to be compatible with existing plugins; or
> > * begin building a direct integration between nova-scheduler and ironic,
> > and create a non-virtualization-centric resource tracker within Nova; or
> > * proceed with the plan we previously outlined, accept that this isn't
> > going to be backwards compatible with nova filter plugins, and apologize
> to
> > any operators who rely on the using the same scheduler plugins for
> > baremetal and virtual resources; or
> > * keep punting on this, bringing pain and suffering to all operators of
> > bare metal clouds, because nova-compute must be run as exactly one
> process
> > for all sizes of clouds.
>
> Thanks for summing this up, Deva. The planned solution still gets my
> vote; we build that, deprecate the old single compute host model where
> nova handles all scheduling, and in the meantime figure out the gaps
> that operators need filled and the best way to fill them. Maybe we can
> fill them by the end of the deprecation period (it's going to need to be
> a couple cycles), or maybe operators that care about these things need
> to carry some downstream patches for a bit.
>
> I'd be curious how many ops out there run ironic with custom scheduler
> filters, or rely on the compute capabilities filters. Rackspace has one
> out of tree weigher for image caching, but are okay with moving forward
> and doing what it takes to move that.
>
> // jim
>
> >
> >
> > Thanks for reading,
> > Devananda
> >
> >
> >
> > [0] Yes, there are some hacks to work around this, but they are bad.
> Please
> > don't encourage their use.
> >
> > [1] https://review.openstack.org/#/c/194453/
> >
> > [2] https://review.openstack.org/#/c/204641/
> >
> > [3]
> >
> https://drive.google.com/file/d/0Bz_nyJF_YYGZWnZ2dlAyejgtdVU/view?usp=sharing
>
> >
> __________________________________________________________________________
> > OpenStack Development Mailing List (not for usage questions)
> > Unsubscribe:
> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20151214/37e396fe/attachment.html>


More information about the OpenStack-dev mailing list