[openstack-dev] [TripleO] Kicking TripleO up a notch

James Slagle james.slagle at gmail.com
Thu Oct 3 22:03:28 UTC 2013


On Tue, Oct 1, 2013 at 4:37 AM, Robert Collins
<robertc at robertcollins.net> wrote:
> Now for a less condensed, and hopefully more useful version :)
> Our goal is to deliver a continuously deployed version of OpenStack.
> Right now, we're working on plumbing to build a /good/ version of
> that. Note the difference: 'deliver an X', 'building stuff to let us
> deliver a good X'.

Overall, I agree with the approach.  But, I think it really helps that a lot of
the low level tooling already exists.  I think a CD environment is definitely
going to help us find issues a lot quicker than we're finding them now.

>
> This is key: we've managed to end up focusing on bottom-up work,
> rather than optimising our ability to deliver the thing, and
> iteratively improving it. The former is necessary but not sufficient.
> Tuskar has been working top down, and (as usual) this results in very
> fast progress; the rest of TripleO has provided a really solid
> foundation, but with many gaps and super rough spots...
>
> So, I'd like to invert our priorities and start with the deliverable,
> however slipshod, and then iterate to make it better and better along
> the design paths we've already thought about. This could extend all
> the way to Tuskar, or we could start with the closest thing within
> reach, which is the existing 'no-state-for-tripleo' style CLI + API
> based tooling.
>
> In the call we had, we agreed that this approach makes a lot of sense,
> and spent a bunch of time talking through the ramifications on TripleO
> and Ironic, and sketched out one way to slice and dice things;
> https://docs.google.com/drawings/d/1kgBlHvkW8Kj_ynCA5oCILg4sPqCUvmlytY5p1p9AjW0/edit?usp=sharing
> is the diagram we came up with.

Phase 0...makes sense.

A couple of questions about the other phases:
What is "Persistent Overcloud with CD" in TripleO Phase 1?
Is that where the overcloud gets upgraded on each commit, vs torn down
and redeployed?
I'd take it this is the image based upgrade approach where we'd need
the read-only /, and
storage somewhere for the persistent data support that has been
previously discussed?

If one of the other goals of the MVP of Phase 1 is to stop causing API
downtime during
upgrades, then this implies an HA Overcloud?  I believe that also
implies that we'd need support
across the upstream Openstack projects of different versions of the same service
being compatible (to an extent).  Meaning, if we have a HA Overcloud
with 2 Control nodes, and
we bring one of the nodes down and upgrade Nova to a newer version,
when we start
the upgraded node again, the 2 running Nova's need to be able to be
interoperable.  AIUI,
this type of support is still not ready in most projects.  But, I
guess that's why this is phase 1
and not 0 :).

In Phase 2, does  Undercloud CD also imply persistent Undercloud?  I'm
guessing yes, since
the Overcloud couldn't stay persistent if it's undercloud was destroyed.


>
> The basic approach is to actually deliver the thing we want to deliver
> - a live working CD overcloud *ourselves* and iterate on that to make
> upgrades of that preserve state,

Ok, I think this roughly answers my question about what "persistent" meant.

> then start tackling CD of it's
> infrastructure, then remove the seed.

Removing the seed and starting with the undercloud is one of the areas
I've looked
at, with the goal being making it easier to bootstrap an undercloud
for the folks working
on Tuskar.  I know I've pointed out these things before, but I wanted
to again here.  I'm not
sure if these efforts align with the long term vision of "removing the
seed", or what
exactly the plan is around that.  I just want to make folks aware of
these, so as to
avoid duplication if similar paths are chosen.

First, there's the undercloud-live effort to build a live usb image of
an undercloud that people can boot, and install if they choose to.
https://github.com/agroup/undercloud-live

Second, undercloud-live makes use of some other python code I worked
on to apply d-i-b
elements to the current system, as opposed to a chroot.  This is the
work I mentioned
in Seattle (still working on a patch for d-i-b proper for this code
btw).  For now, it's at:
https://github.com/agroup/python-dib-elements/

undercloud-live is Fedora based at the moment, because we wanted to integrate
it with the Fedora build toolchain easily.


>
> Ramifications:
>  - long term a much better project health and responsiveness to
> changing user needs.
>  - may cause disruption in the short term as we do whats needed to get
> /something/ working.

I *think* this is a fair trade off.  Though, I'm not sure I understand
the short term
disruption.  Do you just mean there won't be as many people focusing on  devtest
and the low level tooling because instead they're focused on the CD environment?

>  - will need community buy-in and support to make it work : two of the
> key things about working Lean are keeping WIP - inventory - low and
> ensuring that bottlenecks are not used for anything other than
> bottleneck tasks. Both of these things impact what can be done at any
> point in time within the project: we may need to say 'no' to proposed
> work to permit driving more momentum as a whole... at least in the
> short term.

Can you give some examples of something that might be said no to?

In my head, I read that as "refactoring or new functionality that is likely to
break stuff that works now".

Some of the high level things that are important to me now are:

- getting fixes committed to any of the repos that correct issues that
   are causing things to not work as intended
- new d-i-b elements for new functionality
- minor changes to existing d-i-b elements, things like make something
more configurable
  if needed, fix installation issues, etc
- perhaps new heat templates for additional deployment scenarios (as opposed
  to changes to existing heat templates)

Do you see anything like that suffering?

Like you say below, it's open source, so people can still work on what
they want to :).  And in that regard, the things I mentioned above might
only really suffer if there is suddenly a much longer turn around time on
reviews, upstream feedback, etc.

> Basic principles:
>  - unblock bottlenecks first, then unblock everyone else.
>  - folk are still self directed - it's open source - but clear
> visibility of work needed and it's relevance to the product *right
> now* is crucial info for people to make good choices. (and similar
> Mark McLoughlin was asking about what bugs to work on which is a
> symptom of me/us failing to provide clear visibility)
>  - clear communication about TripleO and plans / strategy and priority
> (Datacentre ops? Continuous deployment story?)

+1, that suff make sense.

> Implementing this:
> For TripleO we've broken down the long term vision in a few phases
> that *start* with an end user deliverable and then backfill to add
> sophistication and polish.
>
> We're suggesting that at any point in time the following should be the
> heuristics for TripleO contributors for what to work on:
> 1) Firedrill ‘something we've delivered broke’: Aim to avoid this but
> do if it happens it takes priority.
> 2) Things to make things we've delivered and are maintaining more
> reliable / less likely to break: Things that reduce category 1 work.
> 3) Things to make the things we've delivered better *or* things to
> make something new exist/get delivered.
>
> Our long term steady state should be a small amount of category 2 work
> and a lot of category 3 with no category 1; but to get there we have
> to go through a crucible where it will be all category 1 and category
> 2: we should expect all forward momentum to stop while we get our

I'd classify forward momentum recently as polishing the devtest story,
and working
on the tooling to do so.  So, maybe that is set aside for a moment while
the CD environment is brought up.

However, I think that having a working devtest is important.
devtest can be quite daunting to a newcomer, but, a nice thing about it
is that it gives people not familiar with tripleo and new contributors
a place to
get started.   And, I think that's important for the community.

> stuff lined up and live. After that though we'll have a small stable
> *end product* base, and we can expand that out featurewise and depth
> (reliability/performance/reduce firedrills..)wise.
>
> To surface WIP + current planned work, I find Kanban works super well.
> So I am proposing the following structure:
>  - Current work the team is focused on will be represented as Kanban cards
>  - Those cards can be standalone, or link to an etherpad, or a bug, or
> a blueprint as appropriate
>    - standalone cards should be those that don't fit as bugs or
> blueprints; we shouldn't replace those other tracking systems
>  - As a team we all commit to picking up work based on the heuristics above
>  - The kanban exposes the category of work directly, making it easy to choose
>  - if there is someone working on a higher category of work than us,
> we should bias to *helping them* rather than continuing on our own way
> or picking up a new lower category card: it's better to unblock the
> system as a whole than push forward something we can't use yet.

I'll say that I really like tracking stuff in trello.  I think the
reality is that there are going
to be some well defined project goals (like you're doing here), and
probably other
people (or groups of people) within the community may have sightly
different goals.

Not saying that those are necessarily going to conflict.  Just that there may
be other stuff that folks are trying to accomplish..  The more
stuff like that that can be shared in a public trello for tripleo, the
better for
everyone.



-- 
-- James Slagle
--



More information about the OpenStack-dev mailing list