[openstack-dev] [all] gate debugging
Doug Hellmann
doug at doughellmann.com
Thu Aug 28 19:41:06 UTC 2014
On Aug 28, 2014, at 2:16 PM, Sean Dague <sean at dague.net> wrote:
> On 08/28/2014 02:07 PM, Joe Gordon wrote:
>>
>>
>>
>> On Thu, Aug 28, 2014 at 10:17 AM, Sean Dague <sean at dague.net
>> <mailto:sean at dague.net>> wrote:
>>
>> On 08/28/2014 12:48 PM, Doug Hellmann wrote:
>>>
>>> On Aug 27, 2014, at 5:56 PM, Sean Dague <sean at dague.net
>> <mailto:sean at dague.net>> wrote:
>>>
>>>> On 08/27/2014 05:27 PM, Doug Hellmann wrote:
>>>>>
>>>>> On Aug 27, 2014, at 2:54 PM, Sean Dague <sean at dague.net
>> <mailto:sean at dague.net>> wrote:
>>>>>
>>>>>> Note: thread intentionally broken, this is really a different
>> topic.
>>>>>>
>>>>>> On 08/27/2014 02:30 PM, Doug Hellmann wrote:>
>>>>>>> On Aug 27, 2014, at 1:30 PM, Chris Dent <chdent at redhat.com
>> <mailto:chdent at redhat.com>> wrote:
>>>>>>>
>>>>>>>> On Wed, 27 Aug 2014, Doug Hellmann wrote:
>>>>>>>>
>>>>>>>>> I have found it immensely helpful, for example, to have a
>> written set
>>>>>>>>> of the steps involved in creating a new library, from
>> importing the
>>>>>>>>> git repo all the way through to making it available to other
>> projects.
>>>>>>>>> Without those instructions, it would have been much harder
>> to split up
>>>>>>>>> the work. The team would have had to train each other by word of
>>>>>>>>> mouth, and we would have had constant issues with inconsistent
>>>>>>>>> approaches triggering different failures. The time we spent
>> building
>>>>>>>>> and verifying the instructions has paid off to the extent
>> that we even
>>>>>>>>> had one developer not on the core team handle a graduation
>> for us.
>>>>>>>>
>>>>>>>> +many more for the relatively simple act of just writing
>> stuff down
>>>>>>>
>>>>>>> "Write it down.” is my theme for Kilo.
>>>>>>
>>>>>> I definitely get the sentiment. "Write it down" is also hard
>> when you
>>>>>> are talking about things that do change around quite a bit.
>> OpenStack as
>>>>>> a whole sees 250 - 500 changes a week, so the interaction
>> pattern moves
>>>>>> around enough that it's really easy to have *very* stale
>> information
>>>>>> written down. Stale information is even more dangerous than no
>>>>>> information some times, as it takes people down very wrong paths.
>>>>>>
>>>>>> I think we break down on communication when we get into a
>> conversation
>>>>>> of "I want to learn gate debugging" because I don't quite know
>> what that
>>>>>> means, or where the starting point of understanding is. So those
>>>>>> intentions are well meaning, but tend to stall. The reality was
>> there
>>>>>> was no road map for those of us that dive in, it's just
>> understanding
>>>>>> how OpenStack holds together as a whole and where some of the
>> high risk
>>>>>> parts are. And a lot of that comes with days staring at code
>> and logs
>>>>>> until patterns emerge.
>>>>>>
>>>>>> Maybe if we can get smaller more targeted questions, we can
>> help folks
>>>>>> better? I'm personally a big fan of answering the targeted
>> questions
>>>>>> because then I also know that the time spent exposing that
>> information
>>>>>> was directly useful.
>>>>>>
>>>>>> I'm more than happy to mentor folks. But I just end up finding
>> the "I
>>>>>> want to learn" at the generic level something that's hard to
>> grasp onto
>>>>>> or figure out how we turn it into action. I'd love to hear more
>> ideas
>>>>>> from folks about ways we might do that better.
>>>>>
>>>>> You and a few others have developed an expertise in this
>> important skill. I am so far away from that level of expertise that
>> I don’t know the questions to ask. More often than not I start with
>> the console log, find something that looks significant, spend an
>> hour or so tracking it down, and then have someone tell me that it
>> is a red herring and the issue is really some other thing that they
>> figured out very quickly by looking at a file I never got to.
>>>>>
>>>>> I guess what I’m looking for is some help with the patterns.
>> What made you think to look in one log file versus another? Some of
>> these jobs save a zillion little files, which ones are actually
>> useful? What tools are you using to correlate log entries across all
>> of those files? Are you doing it by hand? Is logstash useful for
>> that, or is that more useful for finding multiple occurrences of the
>> same issue?
>>>>>
>>>>> I realize there’s not a way to write a how-to that will live
>> forever. Maybe one way to deal with that is to write up the research
>> done on bugs soon after they are solved, and publish that to the
>> mailing list. Even the retrospective view is useful because we can
>> all learn from it without having to live through it. The mailing
>> list is a fairly ephemeral medium, and something very old in the
>> archives is understood to have a good chance of being out of date so
>> we don’t have to keep adding disclaimers.
>>>>
>>>> Sure. Matt's actually working up a blog post describing the thing he
>>>> nailed earlier in the week.
>>>
>>> Yes, I appreciate that both of you are responding to my questions. :-)
>>>
>>> I have some more specific questions/comments below. Please take
>> all of this in the spirit of trying to make this process easier by
>> pointing out where I’ve found it hard, and not just me complaining.
>> I’d like to work on fixing any of these things that can be fixed, by
>> writing or reviewing patches for early in kilo.
>>>
>>>>
>>>> Here is my off the cuff set of guidelines:
>>>>
>>>> #1 - is it a test failure or a setup failure
>>>>
>>>> This should be pretty easy to figure out. Test failures come at
>> the end
>>>> of console log and say that tests failed (after you see a bunch of
>>>> passing tempest tests).
>>>>
>>>> Always start at *the end* of files and work backwards.
>>>
>>> That’s interesting because in my case I saw a lot of failures
>> after the initial “real” problem. So I usually read the logs like C
>> compiler output: Assume the first error is real, and the others
>> might have been caused by that one. Do you work from the bottom up
>> to a point where you don’t see any more errors instead of reading
>> top down?
>>
>> Bottom up to get to problems, then figure out if it's in a subprocess so
>> the problems could exist for a while. That being said, not all tools do
>> useful things like actually error when they fail (I'm looking at you
>> yum....) so there are always edge cases here.
>>
>>>>
>>>> #2 - if it's a test failure, what API call was unsuccessful.
>>>>
>>>> Start with looking at the API logs for the service at the top
>> level, and
>>>> see if there is a simple traceback at the right timestamp. If not,
>>>> figure out what that API call was calling out to, again look at the
>>>> simple cases assuming failures will create ERRORS or TRACES
>> (though they
>>>> often don't).
>>>
>>> In my case, a neutron call failed. Most of the other services seem
>> to have a *-api.log file, but neutron doesn’t. It took a little
>> while to find the API-related messages in screen-q-svc.txt (I’m glad
>> I’ve been around long enough to know it used to be called
>> “quantum”). I get that screen-n-*.txt would collide with nova. Is it
>> necessary to abbreviate those filenames at all?
>>
>> Yeh... service naming could definitely be better, especially with
>> neutron. There are implications for long names in screen, but maybe we
>> just get over it as we already have too many tabs to be in one page in
>> the console anymore anyway.
>>
>>>> Hints on the service log order you should go after are on the footer
>>>> over every log page -
>>>>
>> http://logs.openstack.org/76/79776/15/gate/gate-tempest-dsvm-full/700ee7e/logs/
>>>> (it's included as an Apache footer) for some services. It's been
>> there
>>>> for about 18 months, I think people are fully blind to it at this
>> point.
>>>
>>> Where would I go to edit that footer to add information about the
>> neutron log files? Is that Apache footer defined in an infra repo?
>>
>> Note the following at the end of the footer output:
>>
>> About this Help
>>
>> This help file is part of the openstack-infra/config project, and can be
>> found at modules/openstack_project/files/logs/help/tempest_logs.html .
>> The file can be updated via the standard OpenStack Gerrit Review
>> process.
>>
>>
>> I took a first whack at trying to add some more information to the
>> footer here: https://review.openstack.org/#/c/117390/
>
> \o/ - you rock joe!
+1!!
Doug
More information about the OpenStack-dev
mailing list