[openstack-dev] [all] gate debugging
Sean Dague
sean at dague.net
Thu Aug 28 18:16:31 UTC 2014
On 08/28/2014 02:07 PM, Joe Gordon wrote:
>
>
>
> On Thu, Aug 28, 2014 at 10:17 AM, Sean Dague <sean at dague.net
> <mailto:sean at dague.net>> wrote:
>
> On 08/28/2014 12:48 PM, Doug Hellmann wrote:
> >
> > On Aug 27, 2014, at 5:56 PM, Sean Dague <sean at dague.net
> <mailto:sean at dague.net>> wrote:
> >
> >> On 08/27/2014 05:27 PM, Doug Hellmann wrote:
> >>>
> >>> On Aug 27, 2014, at 2:54 PM, Sean Dague <sean at dague.net
> <mailto:sean at dague.net>> wrote:
> >>>
> >>>> Note: thread intentionally broken, this is really a different
> topic.
> >>>>
> >>>> On 08/27/2014 02:30 PM, Doug Hellmann wrote:>
> >>>>> On Aug 27, 2014, at 1:30 PM, Chris Dent <chdent at redhat.com
> <mailto:chdent at redhat.com>> wrote:
> >>>>>
> >>>>>> On Wed, 27 Aug 2014, Doug Hellmann wrote:
> >>>>>>
> >>>>>>> I have found it immensely helpful, for example, to have a
> written set
> >>>>>>> of the steps involved in creating a new library, from
> importing the
> >>>>>>> git repo all the way through to making it available to other
> projects.
> >>>>>>> Without those instructions, it would have been much harder
> to split up
> >>>>>>> the work. The team would have had to train each other by word of
> >>>>>>> mouth, and we would have had constant issues with inconsistent
> >>>>>>> approaches triggering different failures. The time we spent
> building
> >>>>>>> and verifying the instructions has paid off to the extent
> that we even
> >>>>>>> had one developer not on the core team handle a graduation
> for us.
> >>>>>>
> >>>>>> +many more for the relatively simple act of just writing
> stuff down
> >>>>>
> >>>>> "Write it down.” is my theme for Kilo.
> >>>>
> >>>> I definitely get the sentiment. "Write it down" is also hard
> when you
> >>>> are talking about things that do change around quite a bit.
> OpenStack as
> >>>> a whole sees 250 - 500 changes a week, so the interaction
> pattern moves
> >>>> around enough that it's really easy to have *very* stale
> information
> >>>> written down. Stale information is even more dangerous than no
> >>>> information some times, as it takes people down very wrong paths.
> >>>>
> >>>> I think we break down on communication when we get into a
> conversation
> >>>> of "I want to learn gate debugging" because I don't quite know
> what that
> >>>> means, or where the starting point of understanding is. So those
> >>>> intentions are well meaning, but tend to stall. The reality was
> there
> >>>> was no road map for those of us that dive in, it's just
> understanding
> >>>> how OpenStack holds together as a whole and where some of the
> high risk
> >>>> parts are. And a lot of that comes with days staring at code
> and logs
> >>>> until patterns emerge.
> >>>>
> >>>> Maybe if we can get smaller more targeted questions, we can
> help folks
> >>>> better? I'm personally a big fan of answering the targeted
> questions
> >>>> because then I also know that the time spent exposing that
> information
> >>>> was directly useful.
> >>>>
> >>>> I'm more than happy to mentor folks. But I just end up finding
> the "I
> >>>> want to learn" at the generic level something that's hard to
> grasp onto
> >>>> or figure out how we turn it into action. I'd love to hear more
> ideas
> >>>> from folks about ways we might do that better.
> >>>
> >>> You and a few others have developed an expertise in this
> important skill. I am so far away from that level of expertise that
> I don’t know the questions to ask. More often than not I start with
> the console log, find something that looks significant, spend an
> hour or so tracking it down, and then have someone tell me that it
> is a red herring and the issue is really some other thing that they
> figured out very quickly by looking at a file I never got to.
> >>>
> >>> I guess what I’m looking for is some help with the patterns.
> What made you think to look in one log file versus another? Some of
> these jobs save a zillion little files, which ones are actually
> useful? What tools are you using to correlate log entries across all
> of those files? Are you doing it by hand? Is logstash useful for
> that, or is that more useful for finding multiple occurrences of the
> same issue?
> >>>
> >>> I realize there’s not a way to write a how-to that will live
> forever. Maybe one way to deal with that is to write up the research
> done on bugs soon after they are solved, and publish that to the
> mailing list. Even the retrospective view is useful because we can
> all learn from it without having to live through it. The mailing
> list is a fairly ephemeral medium, and something very old in the
> archives is understood to have a good chance of being out of date so
> we don’t have to keep adding disclaimers.
> >>
> >> Sure. Matt's actually working up a blog post describing the thing he
> >> nailed earlier in the week.
> >
> > Yes, I appreciate that both of you are responding to my questions. :-)
> >
> > I have some more specific questions/comments below. Please take
> all of this in the spirit of trying to make this process easier by
> pointing out where I’ve found it hard, and not just me complaining.
> I’d like to work on fixing any of these things that can be fixed, by
> writing or reviewing patches for early in kilo.
> >
> >>
> >> Here is my off the cuff set of guidelines:
> >>
> >> #1 - is it a test failure or a setup failure
> >>
> >> This should be pretty easy to figure out. Test failures come at
> the end
> >> of console log and say that tests failed (after you see a bunch of
> >> passing tempest tests).
> >>
> >> Always start at *the end* of files and work backwards.
> >
> > That’s interesting because in my case I saw a lot of failures
> after the initial “real” problem. So I usually read the logs like C
> compiler output: Assume the first error is real, and the others
> might have been caused by that one. Do you work from the bottom up
> to a point where you don’t see any more errors instead of reading
> top down?
>
> Bottom up to get to problems, then figure out if it's in a subprocess so
> the problems could exist for a while. That being said, not all tools do
> useful things like actually error when they fail (I'm looking at you
> yum....) so there are always edge cases here.
>
> >>
> >> #2 - if it's a test failure, what API call was unsuccessful.
> >>
> >> Start with looking at the API logs for the service at the top
> level, and
> >> see if there is a simple traceback at the right timestamp. If not,
> >> figure out what that API call was calling out to, again look at the
> >> simple cases assuming failures will create ERRORS or TRACES
> (though they
> >> often don't).
> >
> > In my case, a neutron call failed. Most of the other services seem
> to have a *-api.log file, but neutron doesn’t. It took a little
> while to find the API-related messages in screen-q-svc.txt (I’m glad
> I’ve been around long enough to know it used to be called
> “quantum”). I get that screen-n-*.txt would collide with nova. Is it
> necessary to abbreviate those filenames at all?
>
> Yeh... service naming could definitely be better, especially with
> neutron. There are implications for long names in screen, but maybe we
> just get over it as we already have too many tabs to be in one page in
> the console anymore anyway.
>
> >> Hints on the service log order you should go after are on the footer
> >> over every log page -
> >>
> http://logs.openstack.org/76/79776/15/gate/gate-tempest-dsvm-full/700ee7e/logs/
> >> (it's included as an Apache footer) for some services. It's been
> there
> >> for about 18 months, I think people are fully blind to it at this
> point.
> >
> > Where would I go to edit that footer to add information about the
> neutron log files? Is that Apache footer defined in an infra repo?
>
> Note the following at the end of the footer output:
>
> About this Help
>
> This help file is part of the openstack-infra/config project, and can be
> found at modules/openstack_project/files/logs/help/tempest_logs.html .
> The file can be updated via the standard OpenStack Gerrit Review
> process.
>
>
> I took a first whack at trying to add some more information to the
> footer here: https://review.openstack.org/#/c/117390/
\o/ - you rock joe!
-Sean
--
Sean Dague
http://dague.net
More information about the OpenStack-dev
mailing list