[openstack-dev] [all] gate debugging
Doug Hellmann
doug at doughellmann.com
Thu Aug 28 17:48:06 UTC 2014
On Aug 28, 2014, at 1:17 PM, Sean Dague <sean at dague.net> wrote:
> On 08/28/2014 12:48 PM, Doug Hellmann wrote:
>>
>> On Aug 27, 2014, at 5:56 PM, Sean Dague <sean at dague.net> wrote:
>>
>>> On 08/27/2014 05:27 PM, Doug Hellmann wrote:
>>>>
>>>> On Aug 27, 2014, at 2:54 PM, Sean Dague <sean at dague.net> wrote:
>>>>
>>>>> Note: thread intentionally broken, this is really a different topic.
>>>>>
>>>>> On 08/27/2014 02:30 PM, Doug Hellmann wrote:>
>>>>>> On Aug 27, 2014, at 1:30 PM, Chris Dent <chdent at redhat.com> wrote:
>>>>>>
>>>>>>> On Wed, 27 Aug 2014, Doug Hellmann wrote:
>>>>>>>
>>>>>>>> I have found it immensely helpful, for example, to have a written set
>>>>>>>> of the steps involved in creating a new library, from importing the
>>>>>>>> git repo all the way through to making it available to other projects.
>>>>>>>> Without those instructions, it would have been much harder to split up
>>>>>>>> the work. The team would have had to train each other by word of
>>>>>>>> mouth, and we would have had constant issues with inconsistent
>>>>>>>> approaches triggering different failures. The time we spent building
>>>>>>>> and verifying the instructions has paid off to the extent that we even
>>>>>>>> had one developer not on the core team handle a graduation for us.
>>>>>>>
>>>>>>> +many more for the relatively simple act of just writing stuff down
>>>>>>
>>>>>> "Write it down.” is my theme for Kilo.
>>>>>
>>>>> I definitely get the sentiment. "Write it down" is also hard when you
>>>>> are talking about things that do change around quite a bit. OpenStack as
>>>>> a whole sees 250 - 500 changes a week, so the interaction pattern moves
>>>>> around enough that it's really easy to have *very* stale information
>>>>> written down. Stale information is even more dangerous than no
>>>>> information some times, as it takes people down very wrong paths.
>>>>>
>>>>> I think we break down on communication when we get into a conversation
>>>>> of "I want to learn gate debugging" because I don't quite know what that
>>>>> means, or where the starting point of understanding is. So those
>>>>> intentions are well meaning, but tend to stall. The reality was there
>>>>> was no road map for those of us that dive in, it's just understanding
>>>>> how OpenStack holds together as a whole and where some of the high risk
>>>>> parts are. And a lot of that comes with days staring at code and logs
>>>>> until patterns emerge.
>>>>>
>>>>> Maybe if we can get smaller more targeted questions, we can help folks
>>>>> better? I'm personally a big fan of answering the targeted questions
>>>>> because then I also know that the time spent exposing that information
>>>>> was directly useful.
>>>>>
>>>>> I'm more than happy to mentor folks. But I just end up finding the "I
>>>>> want to learn" at the generic level something that's hard to grasp onto
>>>>> or figure out how we turn it into action. I'd love to hear more ideas
>>>>> from folks about ways we might do that better.
>>>>
>>>> You and a few others have developed an expertise in this important skill. I am so far away from that level of expertise that I don’t know the questions to ask. More often than not I start with the console log, find something that looks significant, spend an hour or so tracking it down, and then have someone tell me that it is a red herring and the issue is really some other thing that they figured out very quickly by looking at a file I never got to.
>>>>
>>>> I guess what I’m looking for is some help with the patterns. What made you think to look in one log file versus another? Some of these jobs save a zillion little files, which ones are actually useful? What tools are you using to correlate log entries across all of those files? Are you doing it by hand? Is logstash useful for that, or is that more useful for finding multiple occurrences of the same issue?
>>>>
>>>> I realize there’s not a way to write a how-to that will live forever. Maybe one way to deal with that is to write up the research done on bugs soon after they are solved, and publish that to the mailing list. Even the retrospective view is useful because we can all learn from it without having to live through it. The mailing list is a fairly ephemeral medium, and something very old in the archives is understood to have a good chance of being out of date so we don’t have to keep adding disclaimers.
>>>
>>> Sure. Matt's actually working up a blog post describing the thing he
>>> nailed earlier in the week.
>>
>> Yes, I appreciate that both of you are responding to my questions. :-)
>>
>> I have some more specific questions/comments below. Please take all of this in the spirit of trying to make this process easier by pointing out where I’ve found it hard, and not just me complaining. I’d like to work on fixing any of these things that can be fixed, by writing or reviewing patches for early in kilo.
>>
>>>
>>> Here is my off the cuff set of guidelines:
>>>
>>> #1 - is it a test failure or a setup failure
>>>
>>> This should be pretty easy to figure out. Test failures come at the end
>>> of console log and say that tests failed (after you see a bunch of
>>> passing tempest tests).
>>>
>>> Always start at *the end* of files and work backwards.
>>
>> That’s interesting because in my case I saw a lot of failures after the initial “real” problem. So I usually read the logs like C compiler output: Assume the first error is real, and the others might have been caused by that one. Do you work from the bottom up to a point where you don’t see any more errors instead of reading top down?
>
> Bottom up to get to problems, then figure out if it's in a subprocess so
> the problems could exist for a while. That being said, not all tools do
> useful things like actually error when they fail (I'm looking at you
> yum....) so there are always edge cases here.
>
>>>
>>> #2 - if it's a test failure, what API call was unsuccessful.
>>>
>>> Start with looking at the API logs for the service at the top level, and
>>> see if there is a simple traceback at the right timestamp. If not,
>>> figure out what that API call was calling out to, again look at the
>>> simple cases assuming failures will create ERRORS or TRACES (though they
>>> often don't).
>>
>> In my case, a neutron call failed. Most of the other services seem to have a *-api.log file, but neutron doesn’t. It took a little while to find the API-related messages in screen-q-svc.txt (I’m glad I’ve been around long enough to know it used to be called “quantum”). I get that screen-n-*.txt would collide with nova. Is it necessary to abbreviate those filenames at all?
>
> Yeh... service naming could definitely be better, especially with
> neutron. There are implications for long names in screen, but maybe we
> just get over it as we already have too many tabs to be in one page in
> the console anymore anyway.
>
>>> Hints on the service log order you should go after are on the footer
>>> over every log page -
>>> http://logs.openstack.org/76/79776/15/gate/gate-tempest-dsvm-full/700ee7e/logs/
>>> (it's included as an Apache footer) for some services. It's been there
>>> for about 18 months, I think people are fully blind to it at this point.
>>
>> Where would I go to edit that footer to add information about the neutron log files? Is that Apache footer defined in an infra repo?
>
> Note the following at the end of the footer output:
>
> About this Help
>
> This help file is part of the openstack-infra/config project, and can be
> found at modules/openstack_project/files/logs/help/tempest_logs.html .
> The file can be updated via the standard OpenStack Gerrit Review process.
/me smacks forehead
>
>> Another specific issue I’ve seen is a message that says something to the effect “the setup for this job failed, check the appropriate log”. I found 2 files with “setup” in the name, but the failure was actually logged in a different file (devstacklog.txt). Is the job definition too far “removed” from the scripts to know what the real filename is? Is it running scripts that log to multiple files during the setup phase, and so it doesn’t know which to refer me to? Or maybe I overlooked a message about when logging to a specific file started.
>
> Part of the issue here is that devstack-gate runs a lot of different
> gate_hooks. So that's about as specific as we can get unless you can
> figure out how to introspect that info in bash... which I couldn’t.
Are all of the hooks logging to the same file? If not, why not? Would it make sense to change that so the error messages could be more specific?
>
>>> If nothing jumps out at ERROR or TRACE, go back to DEBUG level and
>>> figure out what's happening at the time of failure, especially keeping
>>> an eye out of areas where other workers are doing interesting things at
>>> the same time, possibly indicating state corruption in OpenStack as a race.
>>>
>>> #3 - if it's a console failure, start at the end and work backwards
>>>
>>> devstack and grenade run under set -o errexit so that they will
>>> critically exit if a command fails. They will typically dump some debug
>>> when they do that. So the failing command won't be the last line in the
>>> file, but it will be close. The words 'error' typically aren't useful at
>>> all in shell because lots of things say error when they aren't, we mask
>>> their exit codes if their failure is generally irrelevant.
>>>
>>> #4 - general principle the closer to root cause the better
>>>
>>> If we think of exposure of bugs as layers we probably end up
>>> withsomething like this
>>>
>>> - Console log
>>> - Test Name + Failure
>>> - Failure inside an API service
>>> - Failure inside a worker process
>>> - Actual failure figured out in OpenStack code path
>>> - Failure in something below OpenStack (kernel, libvirt)
>>>
>>> This is why signatures that are just test names aren't all that useful
>>> much of the time (and why we try not to add those to ER), as that's
>>> going to be hitting an API, but the why of things is very much still
>>> undiscovered.
>>>
>>> #5 - if it's an infrastructure level setup bug (failing to download or
>>> install something) figure out if there are other likewise events at the
>>> same time (i.e. it's a network issue, which we can't fix) vs. a
>>> structural issue.
>>>
>>>
>>> I find Elastic Search good for step 5, but realistically for all other
>>> steps it's manual log sifting. I open lots of tabs in Chrome, and search
>>> by timestamp.
>>
>> This feels like something we could improve on. If we had a tool to download the logs and interleave the messages using their timestamps, would that make it easier? We could probably make the job log everything to a single file, but I can see where sometimes only having part of the data to look at would be more useful.
>
> Maybe, I find the ability to change the filtering level dynamically to
> be pretty important. We actually did some of this once when we used
> syslog. Personally I found it a ton harder to get to the bottom of things.
>
> A gate run has 25+ services running, it's a rare issue that combines
> interactions between > 4 of them to get to a solution. So I expect you'd
> exchange context jumping, for tons of irrelevancy. That's a personal
> opinion based on personal workflow, and why I never spent time on it.
> Instead I built os-loganalyze that does the filtering and coloring
> dynamically on the server side, as it was a zero install solution that
> provided additional benefits of being able to link to a timestamp in a
> log for sharing purposes.
Sure, that makes sense.
>
>>
>>>
>>>
>>> A big part of the experience also just comes from a manual bayesian
>>> filter. Certain scary looking things in the console log aren't, but you
>>> don't know that unless you look at setup logs enough (either in gate or
>>> in your own devstacks) to realize that. Sanitizing the output of that
>>> part of the process is pretty intractable... because shell (though I've
>>> put some serious effort into it over the last 6 months).
>>
>> Maybe our scripts can emit messages to explain the scary stuff? “This is going to report a problem, but you can ignore it unless X happens.”?
>
> Maybe, like I said it's a lot better than it used to be. But very few
> people are putting in effort here, and I'm not convinced it's really
> solveable in bash.
OK, well, if the answers to these questions are “yes” then I should have time to help, which is why I’m exploring options.
>
>>> Sanitizing the OpenStack logs to be crisp about actual things going
>>> wrong, vs. not, shouldn't be intractable, but it feels like it some
>>> times. Which is why all operators run at DEBUG level. The thing that
>>> makes it hard for developers to see the issues here is the same thing
>>> that makes it *really* hard for operators to figure out failures. It's
>>> also why I tried (though executed poorly on, sorry about that) getting
>>> log cleanups rolling this cycle.
>>
>> I would like to have the TC back an official cross-project effort to clean up the logs for Kilo, and get all of the integrated projects to commit to working on it as a priority.
>>
>> Doug
>>
>>>
>>> -Sean
>>>
>>> --
>>> Sean Dague
>>> http://dague.net
>>>
>>> _______________________________________________
>>> OpenStack-dev mailing list
>>> OpenStack-dev at lists.openstack.org
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>>
>> _______________________________________________
>> OpenStack-dev mailing list
>> OpenStack-dev at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
>
> --
> Sean Dague
> http://dague.net
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
More information about the OpenStack-dev
mailing list