[openstack-dev] [all] gate debugging

Sean Dague sean at dague.net
Thu Aug 28 18:16:31 UTC 2014


On 08/28/2014 02:07 PM, Joe Gordon wrote:
> 
> 
> 
> On Thu, Aug 28, 2014 at 10:17 AM, Sean Dague <sean at dague.net
> <mailto:sean at dague.net>> wrote:
> 
>     On 08/28/2014 12:48 PM, Doug Hellmann wrote:
>     >
>     > On Aug 27, 2014, at 5:56 PM, Sean Dague <sean at dague.net
>     <mailto:sean at dague.net>> wrote:
>     >
>     >> On 08/27/2014 05:27 PM, Doug Hellmann wrote:
>     >>>
>     >>> On Aug 27, 2014, at 2:54 PM, Sean Dague <sean at dague.net
>     <mailto:sean at dague.net>> wrote:
>     >>>
>     >>>> Note: thread intentionally broken, this is really a different
>     topic.
>     >>>>
>     >>>> On 08/27/2014 02:30 PM, Doug Hellmann wrote:>
>     >>>>> On Aug 27, 2014, at 1:30 PM, Chris Dent <chdent at redhat.com
>     <mailto:chdent at redhat.com>> wrote:
>     >>>>>
>     >>>>>> On Wed, 27 Aug 2014, Doug Hellmann wrote:
>     >>>>>>
>     >>>>>>> I have found it immensely helpful, for example, to have a
>     written set
>     >>>>>>> of the steps involved in creating a new library, from
>     importing the
>     >>>>>>> git repo all the way through to making it available to other
>     projects.
>     >>>>>>> Without those instructions, it would have been much harder
>     to split up
>     >>>>>>> the work. The team would have had to train each other by word of
>     >>>>>>> mouth, and we would have had constant issues with inconsistent
>     >>>>>>> approaches triggering different failures. The time we spent
>     building
>     >>>>>>> and verifying the instructions has paid off to the extent
>     that we even
>     >>>>>>> had one developer not on the core team handle a graduation
>     for us.
>     >>>>>>
>     >>>>>> +many more for the relatively simple act of just writing
>     stuff down
>     >>>>>
>     >>>>> "Write it down.” is my theme for Kilo.
>     >>>>
>     >>>> I definitely get the sentiment. "Write it down" is also hard
>     when you
>     >>>> are talking about things that do change around quite a bit.
>     OpenStack as
>     >>>> a whole sees 250 - 500 changes a week, so the interaction
>     pattern moves
>     >>>> around enough that it's really easy to have *very* stale
>     information
>     >>>> written down. Stale information is even more dangerous than no
>     >>>> information some times, as it takes people down very wrong paths.
>     >>>>
>     >>>> I think we break down on communication when we get into a
>     conversation
>     >>>> of "I want to learn gate debugging" because I don't quite know
>     what that
>     >>>> means, or where the starting point of understanding is. So those
>     >>>> intentions are well meaning, but tend to stall. The reality was
>     there
>     >>>> was no road map for those of us that dive in, it's just
>     understanding
>     >>>> how OpenStack holds together as a whole and where some of the
>     high risk
>     >>>> parts are. And a lot of that comes with days staring at code
>     and logs
>     >>>> until patterns emerge.
>     >>>>
>     >>>> Maybe if we can get smaller more targeted questions, we can
>     help folks
>     >>>> better? I'm personally a big fan of answering the targeted
>     questions
>     >>>> because then I also know that the time spent exposing that
>     information
>     >>>> was directly useful.
>     >>>>
>     >>>> I'm more than happy to mentor folks. But I just end up finding
>     the "I
>     >>>> want to learn" at the generic level something that's hard to
>     grasp onto
>     >>>> or figure out how we turn it into action. I'd love to hear more
>     ideas
>     >>>> from folks about ways we might do that better.
>     >>>
>     >>> You and a few others have developed an expertise in this
>     important skill. I am so far away from that level of expertise that
>     I don’t know the questions to ask. More often than not I start with
>     the console log, find something that looks significant, spend an
>     hour or so tracking it down, and then have someone tell me that it
>     is a red herring and the issue is really some other thing that they
>     figured out very quickly by looking at a file I never got to.
>     >>>
>     >>> I guess what I’m looking for is some help with the patterns.
>     What made you think to look in one log file versus another? Some of
>     these jobs save a zillion little files, which ones are actually
>     useful? What tools are you using to correlate log entries across all
>     of those files? Are you doing it by hand? Is logstash useful for
>     that, or is that more useful for finding multiple occurrences of the
>     same issue?
>     >>>
>     >>> I realize there’s not a way to write a how-to that will live
>     forever. Maybe one way to deal with that is to write up the research
>     done on bugs soon after they are solved, and publish that to the
>     mailing list. Even the retrospective view is useful because we can
>     all learn from it without having to live through it. The mailing
>     list is a fairly ephemeral medium, and something very old in the
>     archives is understood to have a good chance of being out of date so
>     we don’t have to keep adding disclaimers.
>     >>
>     >> Sure. Matt's actually working up a blog post describing the thing he
>     >> nailed earlier in the week.
>     >
>     > Yes, I appreciate that both of you are responding to my questions. :-)
>     >
>     > I have some more specific questions/comments below. Please take
>     all of this in the spirit of trying to make this process easier by
>     pointing out where I’ve found it hard, and not just me complaining.
>     I’d like to work on fixing any of these things that can be fixed, by
>     writing or reviewing patches for early in kilo.
>     >
>     >>
>     >> Here is my off the cuff set of guidelines:
>     >>
>     >> #1 - is it a test failure or a setup failure
>     >>
>     >> This should be pretty easy to figure out. Test failures come at
>     the end
>     >> of console log and say that tests failed (after you see a bunch of
>     >> passing tempest tests).
>     >>
>     >> Always start at *the end* of files and work backwards.
>     >
>     > That’s interesting because in my case I saw a lot of failures
>     after the initial “real” problem. So I usually read the logs like C
>     compiler output: Assume the first error is real, and the others
>     might have been caused by that one. Do you work from the bottom up
>     to a point where you don’t see any more errors instead of reading
>     top down?
> 
>     Bottom up to get to problems, then figure out if it's in a subprocess so
>     the problems could exist for a while. That being said, not all tools do
>     useful things like actually error when they fail (I'm looking at you
>     yum....) so there are always edge cases here.
> 
>     >>
>     >> #2 - if it's a test failure, what API call was unsuccessful.
>     >>
>     >> Start with looking at the API logs for the service at the top
>     level, and
>     >> see if there is a simple traceback at the right timestamp. If not,
>     >> figure out what that API call was calling out to, again look at the
>     >> simple cases assuming failures will create ERRORS or TRACES
>     (though they
>     >> often don't).
>     >
>     > In my case, a neutron call failed. Most of the other services seem
>     to have a *-api.log file, but neutron doesn’t. It took a little
>     while to find the API-related messages in screen-q-svc.txt (I’m glad
>     I’ve been around long enough to know it used to be called
>     “quantum”). I get that screen-n-*.txt would collide with nova. Is it
>     necessary to abbreviate those filenames at all?
> 
>     Yeh... service naming could definitely be better, especially with
>     neutron. There are implications for long names in screen, but maybe we
>     just get over it as we already have too many tabs to be in one page in
>     the console anymore anyway.
> 
>     >> Hints on the service log order you should go after are on the footer
>     >> over every log page -
>     >>
>     http://logs.openstack.org/76/79776/15/gate/gate-tempest-dsvm-full/700ee7e/logs/
>     >> (it's included as an Apache footer) for some services. It's been
>     there
>     >> for about 18 months, I think people are fully blind to it at this
>     point.
>     >
>     > Where would I go to edit that footer to add information about the
>     neutron log files? Is that Apache footer defined in an infra repo?
> 
>     Note the following at the end of the footer output:
> 
>     About this Help
> 
>     This help file is part of the openstack-infra/config project, and can be
>     found at modules/openstack_project/files/logs/help/tempest_logs.html .
>     The file can be updated via the standard OpenStack Gerrit Review
>     process.
> 
> 
> I took a first whack at trying to add some more information to the
> footer here: https://review.openstack.org/#/c/117390/

\o/ - you rock joe!

	-Sean

-- 
Sean Dague
http://dague.net



More information about the OpenStack-dev mailing list