Open Stack

Wed Aug 27 19:12:19 UTC 2014

On Wed, Aug 27, 2014 at 10:59:36AM -0400, Doug Hellmann wrote:
> 
> On Aug 27, 2014, at 8:47 AM, Sean Dague <sean at dague.net> wrote:

[. . .]

> > So I think we all want the future where OpenStack is a really nice set
> > of composable services that let you easily create the cloud you want.
> > They are all stable, upgradeable, and easy to understand.
> > 
> > I think the challenge is, personally, I don't see how we get there on
> > current course and speed.
> > 
> > I just got back from effectively 2 weeks out (1 on vacation, 1 at Linux
> > con). And basically the moment I was seen active on IRC I got slammed
> > with pings of 'this kind of job now seems wedged, help!' 'this thing
> > between two projects is blocking and we can't figure out how to land the
> > code, help!'
> > 
> > As someone who spends a ton of time unwinding the completely crazy ways
> > OpenStack fails to work with OpenStack given the complexity of
> > interaction, it's exhausting. And, honestly, I'm feeling pretty strongly
> > that it's time for me to step away from the gate. Because I actually
> > think that the fact that I built this mental model and have been able to
> > smooth over so many things is actually making the situation worse,
> > because people largely don't realize how *incredibly* manual the
> > assembly of these parts are, and that it's not sustainable.
> 
> You do an amazing job with gate debugging.

This cannot be over-emphasized! Sean helped me debug a few Gate issues
and introduced me to the relevant infrastructure tools (Elastic Recheck,
LogStash).

> There are a lot of other
> people, myself included, who would like to be able to help but do not
> yet have the expertise to do it the way you do. What can we do to
> shorten the “training” period? Is there a way to bootstrap us?

FWIW, from my limited observation of debugging a few Gate issues, I
think it's the "frustrating phase" (as you put it in your other email)
where the time consuming human intervention that's needed.

I'm sure you're familiar with it, just spelling it out what I do as I
learn:

  - A patch fails due to a random Jenkins job that's blocking the
    Gate
  - Check the logs for that specifc test job, grep for errors/failures
  - After some initial investigation, you stumble on a specific error
    from a specific test
     - Check if there's an existing bug for that
        - Yes? Particiapte in it, spend time doing root cause analysis
          if you're familiar with the area or notify relevant
          developers. Also, check if someone already wrote an Elastic
          Recheck signature for that bug
        - No? File a bug, write an Elastic Recheck query for it and
          submit for review. 

  - If the bug is blocking Gate, we usually see the fire extinguishers
    in action before we realize, root cause identified, fix submitted to
    unblock the Gate, etc.

[Real Life: 01:00 AM? Think of getting some sleep :-) ]

That's just a very rough approximation. I'm sure missing a few
intriciate steps, but writing it top off my head as I try to learn Gate
debugging.

PS: If it's a rarely reproducible bug, then, multiply all the above (and
more) by N factor to feel the untold pain.

> Having an example with some logs and then even stream of consciousness
> notes like “I noticed the out of memory error, and then I found the
> first instance of that and looked at the oom-killer report in syslog
> to see which process was killed and it was X which might mean Y” would
> help.

Absolutely. 

And, I find it  really helpful to have the failure logs propmted with
useful information/URLs for anyone to get started with priliminary
investigation.

-- 
/kashyap

Open Stack

[openstack-dev] [all] The future of the integrated release

OpenStack

Community

Documentation

Branding & Legal