[openstack-dev] [Ironic] [Nova] continuing the "multiple compute host" discussion

James Penick jpenick at gmail.com
Wed Dec 16 21:51:47 UTC 2015


>We actually called out this problem in the Ironic midcycle and the Tokyo
>summit - we decided to report Ironic's total capacity from each compute
>host (resulting in over-reporting from Nova), and real capacity (for
>purposes of reporting, monitoring, whatever) should be fetched by
>operators from Ironic (IIRC, you specifically were okay with this
>limitation). This is still wrong, but it's the least wrong of any option
>(yes, all are wrong in some way). See the spec[1] for more details.

I do recall that discussion, but the merged spec says:

"In general, a nova-compute running the Ironic virt driver should expose
(total resources)/(number of compute services). This allows for resources
to be
sharded across multiple compute services without over-reporting resources."

I agree that what you said via email is Less Awful than what I read on the
spec (Did I misread it? Am I full of crazy?)

>We *do* still
>need to figure out how to handle availability zones or host aggregates,
>but I expect we would pass along that data to be matched against. I
>think it would just be metadata on a node. Something like
>node.properties['availability_zone'] = 'rackspace-iad-az3' or what have
>you. Ditto for host aggregates - add the metadata to ironic to match
>what's in the host aggregate. I'm honestly not sure what to do about
>(anti-)affinity filters; we'll need help figuring that out.
&
>Right, I didn't mean gantt specifically, but rather "splitting out the
>scheduler" like folks keep talking about. That's why I said "actually
>exists". :)

 I think splitting out the scheduler isn't going to be realistic. My
feeling is, if Nova is going to fulfill its destiny of being The Compute
Service, then the scheduler will stay put and the VM pieces will split out
into another service (Which I think should be named "Seamus" so I can refer
to it as "The wee baby Seamus").

(re: ironic maintaining host aggregates)
>Yes, and yes, assuming those things are valuable to our users. The
>former clearly is, the latter will clearly depend on the change but I
>expect we will evolve to continue to fit Nova's model of the world
>(after all, fitting into Nova's model is a huge chunk of what we do, and
>is exactly what we're trying to do with this work).

It's a lot easier to fit into the nova model if we just use what's there
and don't bother trying to replicate it.

>Again, the other solutions I'm seeing that *do* solve more problems are:
>* Rewrite the resource tracker

>Do you have an entire team (yes, it will take a relatively large team,
>especially when you include some cores dedicated to reviewing the code)
>that can dedicate a couple of development cycles to one of these?

 We can certainly help.

>I sure
>don't. If and when we do, we can move forward on that and deprecate this
>model, if we find that to be a useful thing to do at that time. Right
>now, this is the best plan I have, that we can commit to completing in a
>reasonable timeframe.

I respect that you're trying to solve the problem we have right now to make
operators lives Suck Less. But I think that a short term decision made now
would hurt a lot more later on.

-James

On Wed, Dec 16, 2015 at 8:03 AM, Jim Rollenhagen <jim at jimrollenhagen.com>
wrote:

> On Tue, Dec 15, 2015 at 05:19:19PM -0800, James Penick wrote:
> > > getting rid of the raciness of ClusteredComputeManager in my
> > >current deployment. And I'm willing to help other operators do the same.
> >
> >  You do alleviate race, but at the cost of complexity and
> > unpredictability.  Breaking that down, let's say we go with the current
> > plan and the compute host abstracts hardware specifics from Nova.  The
> > compute host will report (sum of resources)/(sum of managed compute).  If
> > the hardware beneath that compute host is heterogenous, then the
> resources
> > reported up to nova are not correct, and that really does have
> significant
> > impact on deployers.
> >
> >  As an example: Let's say we have 20 nodes behind a compute process.
> Half
> > of those nodes have 24T of disk, the other have 1T.  An attempt to
> schedule
> > a node with 24T of disk will fail, because Nova scheduler is only aware
> of
> > 12.5T of disk.
>
> We actually called out this problem in the Ironic midcycle and the Tokyo
> summit - we decided to report Ironic's total capacity from each compute
> host (resulting in over-reporting from Nova), and real capacity (for
> purposes of reporting, monitoring, whatever) should be fetched by
> operators from Ironic (IIRC, you specifically were okay with this
> limitation). This is still wrong, but it's the least wrong of any option
> (yes, all are wrong in some way). See the spec[1] for more details.
>
> >  Ok, so one could argue that you should just run two compute processes
> per
> > type of host (N+1 redundancy).  If you have different raid levels on two
> > otherwise identical hosts, you'll now need a new compute process for each
> > variant of hardware.  What about host aggregates or availability zones?
> > This sounds like an N^2 problem.  A mere 2 host flavors spread across 2
> > availability zones means 8 compute processes.
> >
> > I have hundreds of hardware flavors, across different security, network,
> > and power availability zones.
>
> Nobody is talking about running a compute per flavor or capability. All
> compute hosts will be able to handle all ironic nodes. We *do* still
> need to figure out how to handle availability zones or host aggregates,
> but I expect we would pass along that data to be matched against. I
> think it would just be metadata on a node. Something like
> node.properties['availability_zone'] = 'rackspace-iad-az3' or what have
> you. Ditto for host aggregates - add the metadata to ironic to match
> what's in the host aggregate. I'm honestly not sure what to do about
> (anti-)affinity filters; we'll need help figuring that out.
>
> > >None of this precludes getting to a better world where Gantt actually
> > >exists, or the resource tracker works well with Ironic.
> >
> > It doesn't preclude it, no. But Gantt is dead[1], and I haven't seen any
> > movement to bring it back.
>
> Right, I didn't mean gantt specifically, but rather "splitting out the
> scheduler" like folks keep talking about. That's why I said "actually
> exists". :)
>
> > >It just gets us to an incrementally better model in the meantime.
> >
> >  I strongly disagree. Will Ironic manage its own concept of availability
> > zones and host aggregates?  What if nova changes their model, will Ironic
> > change to mirror it?  If not I now need to model the same topology in two
> > different ways.
>
> Yes, and yes, assuming those things are valuable to our users. The
> former clearly is, the latter will clearly depend on the change but I
> expect we will evolve to continue to fit Nova's model of the world
> (after all, fitting into Nova's model is a huge chunk of what we do, and
> is exactly what we're trying to do with this work).
>
> >  In that context, breaking out scheduling and "hiding" ironic resources
> > behind a compute process is going to create more problems than it will
> > solve, and is not the "Least bad" of the options to me.
>
> Again, the other solutions I'm seeing that *do* solve more problems are:
>
> * Rewrite the resource tracker
> * Break out the scheduler into a separate thing
>
> Do you have an entire team (yes, it will take a relatively large team,
> especially when you include some cores dedicated to reviewing the code)
> that can dedicate a couple of development cycles to one of these? I sure
> don't. If and when we do, we can move forward on that and deprecate this
> model, if we find that to be a useful thing to do at that time. Right
> now, this is the best plan I have, that we can commit to completing in a
> reasonable timeframe.
>
> // jim
>
> >
> > -James
> > [1] http://git.openstack.org/cgit/openstack/gantt/tree/README.rst
> >
> > On Mon, Dec 14, 2015 at 5:28 PM, Jim Rollenhagen <jim at jimrollenhagen.com
> >
> > wrote:
> >
> > > On Mon, Dec 14, 2015 at 04:15:42PM -0800, James Penick wrote:
> > > > I'm very much against it.
> > > >
> > > >  In my environment we're going to be depending heavily on the nova
> > > > scheduler for affinity/anti-affinity of physical datacenter
> constructs,
> > > > TOR, Power, etc. Like other operators we need to also have a concept
> of
> > > > host aggregates and availability zones for our baremetal as well. If
> > > these
> > > > decisions move out of Nova, we'd have to replicate that entire
> concept of
> > > > topology inside of the Ironic scheduler. Why do that?
> > > >
> > > > I see there are 3 main problems:
> > > >
> > > > 1. Resource tracker sucks for Ironic.
> > > > 2. We need compute host HA
> > > > 3. We need to schedule compute resources in a consistent way.
> > > >
> > > >  We've been exploring options to get rid of RT entirely. However,
> melwitt
> > > > suggested out that by improving RT itself, and changing it from a
> pull
> > > > model to a push, we skip a lot of these problems. I think it's an
> > > excellent
> > > > point. If RT moves to a push model, Ironic can dynamically register
> nodes
> > > > as they're added, consumed, claimed, etc and update their state in
> Nova.
> > > >
> > > >  Compute host HA is critical for us, too. However, if the compute
> hosts
> > > are
> > > > not responsible for any complex scheduling behaviors, it becomes much
> > > > simpler to move the compute hosts to being nothing more than dumb
> workers
> > > > selected at random.
> > > >
> > > >  With this model, the Nova scheduler can still select compute
> resources
> > > in
> > > > the way that it expects, and deployers can expect to build one
> system to
> > > > manage VM and BM. We get rid of RT race conditions, and gain compute
> HA.
> > >
> > > Right, so Deva mentioned this here. Copied from below:
> > >
> > > > > > Some folks are asking us to implement a
> non-virtualization-centric
> > > > > > scheduler / resource tracker in Nova, or advocating that we wait
> for
> > > the
> > > > > > Nova scheduler to be split-out into a separate project. I do not
> > > believe
> > > > > > the Nova team is interested in the former, I do not want to wait
> for
> > > the
> > > > > > latter, and I do not believe that either one will be an adequate
> > > solution
> > > > > > -- there are other clients (besides Nova) that need to schedule
> > > workloads
> > > > > > on Ironic.
> > >
> > > And I totally agree with him. We can rewrite the resource tracker, or
> we
> > > can break out the scheduler. That will take years - what do you, as an
> > > operator, plan to do in the meantime? As an operator of ironic myself,
> > > I'm willing to eat the pain of figuring out what to do with my
> > > out-of-tree filters (and cells!), in favor of getting rid of the
> > > raciness of ClusteredComputeManager in my current deployment. And I'm
> > > willing to help other operators do the same.
> > >
> > > We've been talking about this for close to a year already - we need
> > > to actually do something. I don't believe we can do this in a
> > > reasonable timeline *and* make everybody (ironic devs, nova devs, and
> > > operators) happy. However, as we said elsewhere in the thread, the old
> > > model will go through a deprecation process, and we can wait to remove
> > > it until we do figure out the path forward for operators like yourself.
> > > Then operators that need out-of-tree filters and the like can keep
> doing
> > > what they're doing, while they help us (or just wait) to build
> something
> > > that meets everyone's needs.
> > >
> > > None of this precludes getting to a better world where Gaant actually
> > > exists, or the resource tracker works well with Ironic. It just gets us
> > > to an incrementally better model in the meantime.
> > >
> > > If someone has a *concrete* proposal (preferably in code) for an
> > > alternative
> > > that can be done relatively quickly and also keep everyone happy here,
> I'm
> > > all ears. But I don't believe one exists at this time, and I'm inclined
> > > to keep rolling forward with what we've got here.
> > >
> > > // jim
> > >
> > > >
> > > > -James
> > > >
> > > > On Thu, Dec 10, 2015 at 4:42 PM, Jim Rollenhagen <
> jim at jimrollenhagen.com
> > > >
> > > > wrote:
> > > >
> > > > > On Thu, Dec 10, 2015 at 03:57:59PM -0800, Devananda van der Veen
> wrote:
> > > > > > All,
> > > > > >
> > > > > > I'm going to attempt to summarize a discussion that's been going
> on
> > > for
> > > > > > over a year now, and still remains unresolved.
> > > > > >
> > > > > > TLDR;
> > > > > > --------
> > > > > >
> > > > > > The main touch-point between Nova and Ironic continues to be a
> pain
> > > > > point,
> > > > > > and despite many discussions between the teams over the last year
> > > > > resulting
> > > > > > in a solid proposal, we have not been able to get consensus on a
> > > solution
> > > > > > that meets everyone's needs.
> > > > > >
> > > > > > Some folks are asking us to implement a
> non-virtualization-centric
> > > > > > scheduler / resource tracker in Nova, or advocating that we wait
> for
> > > the
> > > > > > Nova scheduler to be split-out into a separate project. I do not
> > > believe
> > > > > > the Nova team is interested in the former, I do not want to wait
> for
> > > the
> > > > > > latter, and I do not believe that either one will be an adequate
> > > solution
> > > > > > -- there are other clients (besides Nova) that need to schedule
> > > workloads
> > > > > > on Ironic.
> > > > > >
> > > > > > We need to decide on a path of least pain and then proceed. I
> really
> > > want
> > > > > > to get this done in Mitaka.
> > > > > >
> > > > > >
> > > > > > Long version:
> > > > > > -----------------
> > > > > >
> > > > > > During Liberty, Jim and I worked with Jay Pipes and others on the
> > > Nova
> > > > > team
> > > > > > to come up with a plan. That plan was proposed in a Nova spec
> [1] and
> > > > > > approved in October, shortly before the Mitaka summit. It got
> > > significant
> > > > > > reviews from the Ironic team, since it is predicated on work
> being
> > > done
> > > > > in
> > > > > > Ironic to expose a new "reservations" API endpoint. The details
> of
> > > that
> > > > > > Ironic change were proposed separately [2] but have deadlocked.
> > > > > Discussions
> > > > > > with some operators at and after the Mitaka summit have
> highlighted a
> > > > > > problem with this plan.
> > > > > >
> > > > > > Actually, more than one, so to better understand the divergent
> > > viewpoints
> > > > > > that result in the current deadlock, I drew a diagram [3]. If you
> > > haven't
> > > > > > read both the Nova and Ironic specs already, this diagram
> probably
> > > won't
> > > > > > make sense to you. I'll attempt to explain it a bit with more
> words.
> > > > > >
> > > > > >
> > > > > > [A]
> > > > > > The Nova team wants to remove the (Host, Node) tuple from all the
> > > places
> > > > > > that this exists, and return to scheduling only based on Compute
> > > Host.
> > > > > They
> > > > > > also don't want to change any existing scheduler filters
> (especially
> > > not
> > > > > > compute_capabilities_filter) or the filter scheduler class or
> plugin
> > > > > > mechanisms. And, as far as I understand it, they're not
> interested in
> > > > > > accepting a filter plugin that calls out to external APIs (eg,
> > > Ironic) to
> > > > > > identify a Node and pass that Node's UUID to the Compute Host.
> [[
> > > nova
> > > > > > team: please correct me on any point here where I'm wrong, or
> your
> > > > > > collective views have changed over the last year. ]]
> > > > > >
> > > > > > [B]
> > > > > > OpenStack deployers who are using Nova + Ironic rely on a few
> things:
> > > > > > - compute_capabilities_filter to match
> > > node.properties['capabilities']
> > > > > > against flavor extra_specs.
> > > > > > - other downstream nova scheduler filters that do other sorts of
> > > hardware
> > > > > > matching
> > > > > > These deployers clearly and rightly do not want us to take away
> > > either of
> > > > > > these capabilities, so anything we do needs to be backwards
> > > compatible
> > > > > with
> > > > > > any current Nova scheduler plugins -- even downstream ones.
> > > > > >
> > > > > > [C] To meet the compatibility requirements of [B] without
> requiring
> > > the
> > > > > > nova-scheduler team to do the work, we would need to forklift
> some
> > > parts
> > > > > of
> > > > > > the nova-scheduler code into Ironic. But I think that's terrible,
> > > and I
> > > > > > don't think any OpenStack developers will like it. Furthermore,
> > > operators
> > > > > > have already expressed their distase for this because they want
> to
> > > use
> > > > > the
> > > > > > same filters for virtual and baremetal instances but do not want
> to
> > > > > > duplicate the code (because we all know that's a recipe for
> drift).
> > > > > >
> > > > > > [D]
> > > > > > What ever solution we devise for scheduling bare metal resources
> in
> > > > > Ironic
> > > > > > needs to perform well at the scale Ironic deployments are aiming
> for
> > > (eg,
> > > > > > thousands of Nodes) without the use of Cells. It also must be
> > > integrable
> > > > > > with other software (eg, it should be exposed in our REST API).
> And
> > > it
> > > > > must
> > > > > > allow us to run more than one (active-active) nova-compute
> process,
> > > which
> > > > > > we can't today.
> > > > > >
> > > > > >
> > > > > > OK. That's a lot of words... bear with me, though, as I'm not
> done
> > > yet...
> > > > > >
> > > > > > This drawing [3] is a Venn diagram, but not everything overlaps.
> The
> > > Nova
> > > > > > and Ironic specs [0],[1] meet the needs of the Nova team and the
> > > Ironic
> > > > > > team, and will provide a more performant, highly-available
> solution,
> > > that
> > > > > > is easier to use with other schedulers or datacenter-management
> > > tools.
> > > > > > However, this solution does not meet the needs of some current
> > > OpenStack
> > > > > > Operators because it will not support Nova Scheduler filter
> plugins.
> > > > > Thus,
> > > > > > in the diagram, [A] and [D] overlap but neither one intersects
> with
> > > [B].
> > > > > >
> > > > > >
> > > > > > Summary
> > > > > > --------------
> > > > > >
> > > > > > We have proposed a solution that fits ironic's HA model into
> > > > > nova-compute's
> > > > > > failure domain model, but that's only half of the picture -- in
> so
> > > doing,
> > > > > > we assumed that scheduling of bare metal resources was simplistic
> > > when,
> > > > > in
> > > > > > fact, it needs to be just as rich as the scheduling of virtual
> > > resources.
> > > > > >
> > > > > > So, at this point, I think we need to accept that the scheduling
> of
> > > > > > virtualized and bare metal workloads are two different problem
> > > domains
> > > > > that
> > > > > > are equally complex.
> > > > > >
> > > > > > Either, we:
> > > > > > * build a separate scheduler process in Ironic, forking the Nova
> > > > > scheduler
> > > > > > as a starting point so as to be compatible with existing
> plugins; or
> > > > > > * begin building a direct integration between nova-scheduler and
> > > ironic,
> > > > > > and create a non-virtualization-centric resource tracker within
> > > Nova; or
> > > > > > * proceed with the plan we previously outlined, accept that this
> > > isn't
> > > > > > going to be backwards compatible with nova filter plugins, and
> > > apologize
> > > > > to
> > > > > > any operators who rely on the using the same scheduler plugins
> for
> > > > > > baremetal and virtual resources; or
> > > > > > * keep punting on this, bringing pain and suffering to all
> operators
> > > of
> > > > > > bare metal clouds, because nova-compute must be run as exactly
> one
> > > > > process
> > > > > > for all sizes of clouds.
> > > > >
> > > > > Thanks for summing this up, Deva. The planned solution still gets
> my
> > > > > vote; we build that, deprecate the old single compute host model
> where
> > > > > nova handles all scheduling, and in the meantime figure out the
> gaps
> > > > > that operators need filled and the best way to fill them. Maybe we
> can
> > > > > fill them by the end of the deprecation period (it's going to need
> to
> > > be
> > > > > a couple cycles), or maybe operators that care about these things
> need
> > > > > to carry some downstream patches for a bit.
> > > > >
> > > > > I'd be curious how many ops out there run ironic with custom
> scheduler
> > > > > filters, or rely on the compute capabilities filters. Rackspace
> has one
> > > > > out of tree weigher for image caching, but are okay with moving
> forward
> > > > > and doing what it takes to move that.
> > > > >
> > > > > // jim
> > > > >
> > > > > >
> > > > > >
> > > > > > Thanks for reading,
> > > > > > Devananda
> > > > > >
> > > > > >
> > > > > >
> > > > > > [0] Yes, there are some hacks to work around this, but they are
> bad.
> > > > > Please
> > > > > > don't encourage their use.
> > > > > >
> > > > > > [1] https://review.openstack.org/#/c/194453/
> > > > > >
> > > > > > [2] https://review.openstack.org/#/c/204641/
> > > > > >
> > > > > > [3]
> > > > > >
> > > > >
> > >
> https://drive.google.com/file/d/0Bz_nyJF_YYGZWnZ2dlAyejgtdVU/view?usp=sharing
> > > > >
> > > > > >
> > > > >
> > >
> __________________________________________________________________________
> > > > > > OpenStack Development Mailing List (not for usage questions)
> > > > > > Unsubscribe:
> > > > > OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> > > > > >
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> > > > >
> > > > >
> > > > >
> > >
> __________________________________________________________________________
> > > > > OpenStack Development Mailing List (not for usage questions)
> > > > > Unsubscribe:
> > > OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> > > > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> > > > >
> > >
> > > >
> > >
> __________________________________________________________________________
> > > > OpenStack Development Mailing List (not for usage questions)
> > > > Unsubscribe:
> > > OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> > > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> > >
> > >
> > >
> __________________________________________________________________________
> > > OpenStack Development Mailing List (not for usage questions)
> > > Unsubscribe:
> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> > >
>
> >
> __________________________________________________________________________
> > OpenStack Development Mailing List (not for usage questions)
> > Unsubscribe:
> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20151216/09bb6865/attachment.html>


More information about the OpenStack-dev mailing list