[openstack-dev] [TripleO] Kicking TripleO up a notch

Robert Collins robertc at robertcollins.net
Tue Oct 1 08:37:16 UTC 2013


Warning, this is a little long, but it's the distillation of a
2.mumble hour call I had late last week with Devananda and Clint. It's
a proposal: please do comment and critique it.

The tl;dr read is:
 - we've been doing good work
 - but most of us are currently focused on the tech rather than the
customer stories
 - lets fix that
 - Start with customer story and work back to minimum work needed
   - and we can actually be *delivering* that story [we have hardware
for it, thanks to HP]
 - Focus on efficiency and reducing firefighting before new features
 - https://trello.com/tripleo as an experimental kanban for this [1]

Now for a less condensed, and hopefully more useful version :)

The night before the call I finished reading
http://www.amazon.com/The-Phoenix-Project-Business-ebook/dp/B00AZRBLHO/ref=sr_1_9?s=digital-text&ie=UTF8&qid=1380182909&sr=1-9&keywords=the+goal
which is a devops casting of 'The Goal', a seminal work in the LEAN
manufacturing space. (It's terrible writing in a lot of ways, but it
also does do a pretty good job IMO of highlighting the
systems-thinking aspects of CD... but it doesn't drill into the
detailed analysis of each aspect so some followup reading required to
get chapter and verse on e.g. 'single item flow is ideal').

It reminded me very strongly of things I used to hold as very
important, but I've been sidetracked into playing with the tech -
which I love - and not focusing on ... 'The goal'. I grabbed Clint,
and Deva, and tried to grab Joe - to get a cross section of focus
areas : Heat, Ironic/NovaBM/Nova - to sanity check what was in my head
:).

Our goal is to deliver a continuously deployed version of OpenStack.
Right now, we're working on plumbing to build a /good/ version of
that. Note the difference: 'deliver an X', 'building stuff to let us
deliver a good X'.

This is key: we've managed to end up focusing on bottom-up work,
rather than optimising our ability to deliver the thing, and
iteratively improving it. The former is necessary but not sufficient.
Tuskar has been working top down, and (as usual) this results in very
fast progress; the rest of TripleO has provided a really solid
foundation, but with many gaps and super rough spots...

So, I'd like to invert our priorities and start with the deliverable,
however slipshod, and then iterate to make it better and better along
the design paths we've already thought about. This could extend all
the way to Tuskar, or we could start with the closest thing within
reach, which is the existing 'no-state-for-tripleo' style CLI + API
based tooling.

In the call we had, we agreed that this approach makes a lot of sense,
and spent a bunch of time talking through the ramifications on TripleO
and Ironic, and sketched out one way to slice and dice things;
https://docs.google.com/drawings/d/1kgBlHvkW8Kj_ynCA5oCILg4sPqCUvmlytY5p1p9AjW0/edit?usp=sharing
is the diagram we came up with.

The basic approach is to actually deliver the thing we want to deliver
- a live working CD overcloud *ourselves* and iterate on that to make
upgrades of that preserve state, then start tackling CD of it's
infrastructure, then remove the seed.

Ramifications:
 - long term a much better project health and responsiveness to
changing user needs.
 - may cause disruption in the short term as we do whats needed to get
/something/ working.
 - will need community buy-in and support to make it work : two of the
key things about working Lean are keeping WIP - inventory - low and
ensuring that bottlenecks are not used for anything other than
bottleneck tasks. Both of these things impact what can be done at any
point in time within the project: we may need to say 'no' to proposed
work to permit driving more momentum as a whole... at least in the
short term.
 - highlights that we'll need much better communication about what
work is suitable to tackle now vs what work is a distraction at this
point
   - Which implies much more work from someone in the group on
surfacing work to do and where it's blocked. {I'll happily take this
bullet, for now, for TripleO}
 - May need more hardware :)
 - We'll need to change how we pick work to work on, how we decide
whether to accept new work or not, and how we prioritise things.

Basic principles:
 - unblock bottlenecks first, then unblock everyone else.
 - folk are still self directed - it's open source - but clear
visibility of work needed and it's relevance to the product *right
now* is crucial info for people to make good choices. (and similar
Mark McLoughlin was asking about what bugs to work on which is a
symptom of me/us failing to provide clear visibility)
 - clear communication about TripleO and plans / strategy and priority
(Datacentre ops? Continuous deployment story?)

Earlier this year, within HP, we setup a rack using TripleO of the
time, for a customer, and the experience was fantastic: we made much
more forward progress towards whats needed than we had in the month or
two leading up to it... but then we went back to business as usual,
and things went back to the prior pace.

Implementing this:
For TripleO we've broken down the long term vision in a few phases
that *start* with an end user deliverable and then backfill to add
sophistication and polish.

We're suggesting that at any point in time the following should be the
heuristics for TripleO contributors for what to work on:
1) Firedrill ‘something we've delivered broke’: Aim to avoid this but
do if it happens it takes priority.
2) Things to make things we've delivered and are maintaining more
reliable / less likely to break: Things that reduce category 1 work.
3) Things to make the things we've delivered better *or* things to
make something new exist/get delivered.

Our long term steady state should be a small amount of category 2 work
and a lot of category 3 with no category 1; but to get there we have
to go through a crucible where it will be all category 1 and category
2: we should expect all forward momentum to stop while we get our
stuff lined up and live. After that though we'll have a small stable
*end product* base, and we can expand that out featurewise and depth
(reliability/performance/reduce firedrills..)wise.

To surface WIP + current planned work, I find Kanban works super well.
So I am proposing the following structure:
 - Current work the team is focused on will be represented as Kanban cards
 - Those cards can be standalone, or link to an etherpad, or a bug, or
a blueprint as appropriate
   - standalone cards should be those that don't fit as bugs or
blueprints; we shouldn't replace those other tracking systems
 - As a team we all commit to picking up work based on the heuristics above
 - The kanban exposes the category of work directly, making it easy to choose
 - if there is someone working on a higher category of work than us,
we should bias to *helping them* rather than continuing on our own way
or picking up a new lower category card: it's better to unblock the
system as a whole than push forward something we can't use yet.

Clint and I have setup a draft Kanban, so we can concretely discuss
how this looks and feels.

Seeking-yr-thoughts-ly,
Rob

Notes:
1 - trello is not a super good or bad Kanban, it has the significant
advantages for an experiment that its free and already operational.
Should we decide this works, we'd want to work with -infra to get a
Kanban suitable for dealing with much or all of OpenStack lined up
sooner rather than later. In particular, with the large number of
developers OpenStack has, any outages or defects in a system can have
a huge negative multiplier when they crop up.

-- 
Robert Collins <rbtcollins at hp.com>
Distinguished Technologist
HP Converged Cloud



More information about the OpenStack-dev mailing list