[openstack-dev] [all] gate debugging

Doug Hellmann doug at doughellmann.com
Thu Aug 28 19:41:06 UTC 2014


On Aug 28, 2014, at 2:16 PM, Sean Dague <sean at dague.net> wrote:

> On 08/28/2014 02:07 PM, Joe Gordon wrote:
>> 
>> 
>> 
>> On Thu, Aug 28, 2014 at 10:17 AM, Sean Dague <sean at dague.net
>> <mailto:sean at dague.net>> wrote:
>> 
>>    On 08/28/2014 12:48 PM, Doug Hellmann wrote:
>>> 
>>> On Aug 27, 2014, at 5:56 PM, Sean Dague <sean at dague.net
>>    <mailto:sean at dague.net>> wrote:
>>> 
>>>> On 08/27/2014 05:27 PM, Doug Hellmann wrote:
>>>>> 
>>>>> On Aug 27, 2014, at 2:54 PM, Sean Dague <sean at dague.net
>>    <mailto:sean at dague.net>> wrote:
>>>>> 
>>>>>> Note: thread intentionally broken, this is really a different
>>    topic.
>>>>>> 
>>>>>> On 08/27/2014 02:30 PM, Doug Hellmann wrote:>
>>>>>>> On Aug 27, 2014, at 1:30 PM, Chris Dent <chdent at redhat.com
>>    <mailto:chdent at redhat.com>> wrote:
>>>>>>> 
>>>>>>>> On Wed, 27 Aug 2014, Doug Hellmann wrote:
>>>>>>>> 
>>>>>>>>> I have found it immensely helpful, for example, to have a
>>    written set
>>>>>>>>> of the steps involved in creating a new library, from
>>    importing the
>>>>>>>>> git repo all the way through to making it available to other
>>    projects.
>>>>>>>>> Without those instructions, it would have been much harder
>>    to split up
>>>>>>>>> the work. The team would have had to train each other by word of
>>>>>>>>> mouth, and we would have had constant issues with inconsistent
>>>>>>>>> approaches triggering different failures. The time we spent
>>    building
>>>>>>>>> and verifying the instructions has paid off to the extent
>>    that we even
>>>>>>>>> had one developer not on the core team handle a graduation
>>    for us.
>>>>>>>> 
>>>>>>>> +many more for the relatively simple act of just writing
>>    stuff down
>>>>>>> 
>>>>>>> "Write it down.” is my theme for Kilo.
>>>>>> 
>>>>>> I definitely get the sentiment. "Write it down" is also hard
>>    when you
>>>>>> are talking about things that do change around quite a bit.
>>    OpenStack as
>>>>>> a whole sees 250 - 500 changes a week, so the interaction
>>    pattern moves
>>>>>> around enough that it's really easy to have *very* stale
>>    information
>>>>>> written down. Stale information is even more dangerous than no
>>>>>> information some times, as it takes people down very wrong paths.
>>>>>> 
>>>>>> I think we break down on communication when we get into a
>>    conversation
>>>>>> of "I want to learn gate debugging" because I don't quite know
>>    what that
>>>>>> means, or where the starting point of understanding is. So those
>>>>>> intentions are well meaning, but tend to stall. The reality was
>>    there
>>>>>> was no road map for those of us that dive in, it's just
>>    understanding
>>>>>> how OpenStack holds together as a whole and where some of the
>>    high risk
>>>>>> parts are. And a lot of that comes with days staring at code
>>    and logs
>>>>>> until patterns emerge.
>>>>>> 
>>>>>> Maybe if we can get smaller more targeted questions, we can
>>    help folks
>>>>>> better? I'm personally a big fan of answering the targeted
>>    questions
>>>>>> because then I also know that the time spent exposing that
>>    information
>>>>>> was directly useful.
>>>>>> 
>>>>>> I'm more than happy to mentor folks. But I just end up finding
>>    the "I
>>>>>> want to learn" at the generic level something that's hard to
>>    grasp onto
>>>>>> or figure out how we turn it into action. I'd love to hear more
>>    ideas
>>>>>> from folks about ways we might do that better.
>>>>> 
>>>>> You and a few others have developed an expertise in this
>>    important skill. I am so far away from that level of expertise that
>>    I don’t know the questions to ask. More often than not I start with
>>    the console log, find something that looks significant, spend an
>>    hour or so tracking it down, and then have someone tell me that it
>>    is a red herring and the issue is really some other thing that they
>>    figured out very quickly by looking at a file I never got to.
>>>>> 
>>>>> I guess what I’m looking for is some help with the patterns.
>>    What made you think to look in one log file versus another? Some of
>>    these jobs save a zillion little files, which ones are actually
>>    useful? What tools are you using to correlate log entries across all
>>    of those files? Are you doing it by hand? Is logstash useful for
>>    that, or is that more useful for finding multiple occurrences of the
>>    same issue?
>>>>> 
>>>>> I realize there’s not a way to write a how-to that will live
>>    forever. Maybe one way to deal with that is to write up the research
>>    done on bugs soon after they are solved, and publish that to the
>>    mailing list. Even the retrospective view is useful because we can
>>    all learn from it without having to live through it. The mailing
>>    list is a fairly ephemeral medium, and something very old in the
>>    archives is understood to have a good chance of being out of date so
>>    we don’t have to keep adding disclaimers.
>>>> 
>>>> Sure. Matt's actually working up a blog post describing the thing he
>>>> nailed earlier in the week.
>>> 
>>> Yes, I appreciate that both of you are responding to my questions. :-)
>>> 
>>> I have some more specific questions/comments below. Please take
>>    all of this in the spirit of trying to make this process easier by
>>    pointing out where I’ve found it hard, and not just me complaining.
>>    I’d like to work on fixing any of these things that can be fixed, by
>>    writing or reviewing patches for early in kilo.
>>> 
>>>> 
>>>> Here is my off the cuff set of guidelines:
>>>> 
>>>> #1 - is it a test failure or a setup failure
>>>> 
>>>> This should be pretty easy to figure out. Test failures come at
>>    the end
>>>> of console log and say that tests failed (after you see a bunch of
>>>> passing tempest tests).
>>>> 
>>>> Always start at *the end* of files and work backwards.
>>> 
>>> That’s interesting because in my case I saw a lot of failures
>>    after the initial “real” problem. So I usually read the logs like C
>>    compiler output: Assume the first error is real, and the others
>>    might have been caused by that one. Do you work from the bottom up
>>    to a point where you don’t see any more errors instead of reading
>>    top down?
>> 
>>    Bottom up to get to problems, then figure out if it's in a subprocess so
>>    the problems could exist for a while. That being said, not all tools do
>>    useful things like actually error when they fail (I'm looking at you
>>    yum....) so there are always edge cases here.
>> 
>>>> 
>>>> #2 - if it's a test failure, what API call was unsuccessful.
>>>> 
>>>> Start with looking at the API logs for the service at the top
>>    level, and
>>>> see if there is a simple traceback at the right timestamp. If not,
>>>> figure out what that API call was calling out to, again look at the
>>>> simple cases assuming failures will create ERRORS or TRACES
>>    (though they
>>>> often don't).
>>> 
>>> In my case, a neutron call failed. Most of the other services seem
>>    to have a *-api.log file, but neutron doesn’t. It took a little
>>    while to find the API-related messages in screen-q-svc.txt (I’m glad
>>    I’ve been around long enough to know it used to be called
>>    “quantum”). I get that screen-n-*.txt would collide with nova. Is it
>>    necessary to abbreviate those filenames at all?
>> 
>>    Yeh... service naming could definitely be better, especially with
>>    neutron. There are implications for long names in screen, but maybe we
>>    just get over it as we already have too many tabs to be in one page in
>>    the console anymore anyway.
>> 
>>>> Hints on the service log order you should go after are on the footer
>>>> over every log page -
>>>> 
>>    http://logs.openstack.org/76/79776/15/gate/gate-tempest-dsvm-full/700ee7e/logs/
>>>> (it's included as an Apache footer) for some services. It's been
>>    there
>>>> for about 18 months, I think people are fully blind to it at this
>>    point.
>>> 
>>> Where would I go to edit that footer to add information about the
>>    neutron log files? Is that Apache footer defined in an infra repo?
>> 
>>    Note the following at the end of the footer output:
>> 
>>    About this Help
>> 
>>    This help file is part of the openstack-infra/config project, and can be
>>    found at modules/openstack_project/files/logs/help/tempest_logs.html .
>>    The file can be updated via the standard OpenStack Gerrit Review
>>    process.
>> 
>> 
>> I took a first whack at trying to add some more information to the
>> footer here: https://review.openstack.org/#/c/117390/
> 
> \o/ - you rock joe!

+1!!

Doug




More information about the OpenStack-dev mailing list