[openstack-dev] [all] gate debugging
Doug Hellmann
doug at doughellmann.com
Thu Aug 28 19:40:43 UTC 2014
On Aug 28, 2014, at 2:15 PM, Sean Dague <sean at dague.net> wrote:
> On 08/28/2014 01:48 PM, Doug Hellmann wrote:
>>
>> On Aug 28, 2014, at 1:17 PM, Sean Dague <sean at dague.net> wrote:
>>
>>> On 08/28/2014 12:48 PM, Doug Hellmann wrote:
>>>>
>>>> On Aug 27, 2014, at 5:56 PM, Sean Dague <sean at dague.net> wrote:
>>>>
>>>>> On 08/27/2014 05:27 PM, Doug Hellmann wrote:
>>>>>>
>>>>>> On Aug 27, 2014, at 2:54 PM, Sean Dague <sean at dague.net> wrote:
>>>>>>
>>>>>>> Note: thread intentionally broken, this is really a different topic.
>>>>>>>
>>>>>>> On 08/27/2014 02:30 PM, Doug Hellmann wrote:>
>>>>>>>> On Aug 27, 2014, at 1:30 PM, Chris Dent <chdent at redhat.com> wrote:
>>>>>>>>
>>>>>>>>> On Wed, 27 Aug 2014, Doug Hellmann wrote:
>>>>>>>>>
>>>>>>>>>> I have found it immensely helpful, for example, to have a written set
>>>>>>>>>> of the steps involved in creating a new library, from importing the
>>>>>>>>>> git repo all the way through to making it available to other projects.
>>>>>>>>>> Without those instructions, it would have been much harder to split up
>>>>>>>>>> the work. The team would have had to train each other by word of
>>>>>>>>>> mouth, and we would have had constant issues with inconsistent
>>>>>>>>>> approaches triggering different failures. The time we spent building
>>>>>>>>>> and verifying the instructions has paid off to the extent that we even
>>>>>>>>>> had one developer not on the core team handle a graduation for us.
>>>>>>>>>
>>>>>>>>> +many more for the relatively simple act of just writing stuff down
>>>>>>>>
>>>>>>>> "Write it down.” is my theme for Kilo.
>>>>>>>
>>>>>>> I definitely get the sentiment. "Write it down" is also hard when you
>>>>>>> are talking about things that do change around quite a bit. OpenStack as
>>>>>>> a whole sees 250 - 500 changes a week, so the interaction pattern moves
>>>>>>> around enough that it's really easy to have *very* stale information
>>>>>>> written down. Stale information is even more dangerous than no
>>>>>>> information some times, as it takes people down very wrong paths.
>>>>>>>
>>>>>>> I think we break down on communication when we get into a conversation
>>>>>>> of "I want to learn gate debugging" because I don't quite know what that
>>>>>>> means, or where the starting point of understanding is. So those
>>>>>>> intentions are well meaning, but tend to stall. The reality was there
>>>>>>> was no road map for those of us that dive in, it's just understanding
>>>>>>> how OpenStack holds together as a whole and where some of the high risk
>>>>>>> parts are. And a lot of that comes with days staring at code and logs
>>>>>>> until patterns emerge.
>>>>>>>
>>>>>>> Maybe if we can get smaller more targeted questions, we can help folks
>>>>>>> better? I'm personally a big fan of answering the targeted questions
>>>>>>> because then I also know that the time spent exposing that information
>>>>>>> was directly useful.
>>>>>>>
>>>>>>> I'm more than happy to mentor folks. But I just end up finding the "I
>>>>>>> want to learn" at the generic level something that's hard to grasp onto
>>>>>>> or figure out how we turn it into action. I'd love to hear more ideas
>>>>>>> from folks about ways we might do that better.
>>>>>>
>>>>>> You and a few others have developed an expertise in this important skill. I am so far away from that level of expertise that I don’t know the questions to ask. More often than not I start with the console log, find something that looks significant, spend an hour or so tracking it down, and then have someone tell me that it is a red herring and the issue is really some other thing that they figured out very quickly by looking at a file I never got to.
>>>>>>
>>>>>> I guess what I’m looking for is some help with the patterns. What made you think to look in one log file versus another? Some of these jobs save a zillion little files, which ones are actually useful? What tools are you using to correlate log entries across all of those files? Are you doing it by hand? Is logstash useful for that, or is that more useful for finding multiple occurrences of the same issue?
>>>>>>
>>>>>> I realize there’s not a way to write a how-to that will live forever. Maybe one way to deal with that is to write up the research done on bugs soon after they are solved, and publish that to the mailing list. Even the retrospective view is useful because we can all learn from it without having to live through it. The mailing list is a fairly ephemeral medium, and something very old in the archives is understood to have a good chance of being out of date so we don’t have to keep adding disclaimers.
>>>>>
>>>>> Sure. Matt's actually working up a blog post describing the thing he
>>>>> nailed earlier in the week.
>>>>
>>>> Yes, I appreciate that both of you are responding to my questions. :-)
>>>>
>>>> I have some more specific questions/comments below. Please take all of this in the spirit of trying to make this process easier by pointing out where I’ve found it hard, and not just me complaining. I’d like to work on fixing any of these things that can be fixed, by writing or reviewing patches for early in kilo.
>>>>
>>>>>
>>>>> Here is my off the cuff set of guidelines:
>>>>>
>>>>> #1 - is it a test failure or a setup failure
>>>>>
>>>>> This should be pretty easy to figure out. Test failures come at the end
>>>>> of console log and say that tests failed (after you see a bunch of
>>>>> passing tempest tests).
>>>>>
>>>>> Always start at *the end* of files and work backwards.
>>>>
>>>> That’s interesting because in my case I saw a lot of failures after the initial “real” problem. So I usually read the logs like C compiler output: Assume the first error is real, and the others might have been caused by that one. Do you work from the bottom up to a point where you don’t see any more errors instead of reading top down?
>>>
>>> Bottom up to get to problems, then figure out if it's in a subprocess so
>>> the problems could exist for a while. That being said, not all tools do
>>> useful things like actually error when they fail (I'm looking at you
>>> yum....) so there are always edge cases here.
>>>
>>>>>
>>>>> #2 - if it's a test failure, what API call was unsuccessful.
>>>>>
>>>>> Start with looking at the API logs for the service at the top level, and
>>>>> see if there is a simple traceback at the right timestamp. If not,
>>>>> figure out what that API call was calling out to, again look at the
>>>>> simple cases assuming failures will create ERRORS or TRACES (though they
>>>>> often don't).
>>>>
>>>> In my case, a neutron call failed. Most of the other services seem to have a *-api.log file, but neutron doesn’t. It took a little while to find the API-related messages in screen-q-svc.txt (I’m glad I’ve been around long enough to know it used to be called “quantum”). I get that screen-n-*.txt would collide with nova. Is it necessary to abbreviate those filenames at all?
>>>
>>> Yeh... service naming could definitely be better, especially with
>>> neutron. There are implications for long names in screen, but maybe we
>>> just get over it as we already have too many tabs to be in one page in
>>> the console anymore anyway.
>>>
>>>>> Hints on the service log order you should go after are on the footer
>>>>> over every log page -
>>>>> http://logs.openstack.org/76/79776/15/gate/gate-tempest-dsvm-full/700ee7e/logs/
>>>>> (it's included as an Apache footer) for some services. It's been there
>>>>> for about 18 months, I think people are fully blind to it at this point.
>>>>
>>>> Where would I go to edit that footer to add information about the neutron log files? Is that Apache footer defined in an infra repo?
>>>
>>> Note the following at the end of the footer output:
>>>
>>> About this Help
>>>
>>> This help file is part of the openstack-infra/config project, and can be
>>> found at modules/openstack_project/files/logs/help/tempest_logs.html .
>>> The file can be updated via the standard OpenStack Gerrit Review process.
>>
>> /me smacks forehead
>
> :)
>
> Also note an early version of this base email is at the top level for
> all runs - (i.e. -
> http://logs.openstack.org/76/79776/15/gate/gate-tempest-dsvm-full/700ee7e/)
>
> It's been there about 18 months. People look right past it. Which is
> part of where my skepticism on just writing things down being the
> solution. Because a bunch of it has been written down. But until people
> are in a mode of pulling the information in, pushing it out doesn't help.
Fair enough.
>
>>>> Another specific issue I’ve seen is a message that says something to the effect “the setup for this job failed, check the appropriate log”. I found 2 files with “setup” in the name, but the failure was actually logged in a different file (devstacklog.txt). Is the job definition too far “removed” from the scripts to know what the real filename is? Is it running scripts that log to multiple files during the setup phase, and so it doesn’t know which to refer me to? Or maybe I overlooked a message about when logging to a specific file started.
>>>
>>> Part of the issue here is that devstack-gate runs a lot of different
>>> gate_hooks. So that's about as specific as we can get unless you can
>>> figure out how to introspect that info in bash... which I couldn’t.
>>
>> Are all of the hooks logging to the same file? If not, why not? Would it make sense to change that so the error messages could be more specific?
>
> They are not, output direction is actually typically a function of the
> hook script and not devstack gate.
>
> Some of this is because the tools when run locally need to be able to
> natively support logging. Some of this is because processing logs into
> elastic search requires that we know we understand the log format (a
> generic gate_hook log wouldn't work well there). Some of it is historical.
OK, that makes sense.
>
> I did spend a bunch of time cleaning up the grenade summary log so in
> the console you get some basic idea of what's going on, and what part
> you failed in. Definitely could be better. Taking some of those summary
> lessons into devstack wouldn't hurt either.
I don't think I've hit a grenade issue, so I haven’t seen that.
>
> So patches here are definitely accepted. Which is very much not a blow
> off, but in cleaning d-g up over the last 6 months “the setup for this
Yep, I’m asking if I’m even thinking in the right directions, and that sounds like a “yes” rather than a blow off.
> job failed, check the appropriate log” was about as good as we could
> figure out. Previously the script just died and people usually blamed an
> error message about uploading artifacts in the jenkins output for the
> failure. So if you can figure out a better UX given the constraints
> we're in, definitely appreciated.
I’ll look at the job definitions and see if I can come up with a way to parameterize them or automate the step of figuring out which file is meant for each phase.
>
>>>>> If nothing jumps out at ERROR or TRACE, go back to DEBUG level and
>>>>> figure out what's happening at the time of failure, especially keeping
>>>>> an eye out of areas where other workers are doing interesting things at
>>>>> the same time, possibly indicating state corruption in OpenStack as a race.
>>>>>
>>>>> #3 - if it's a console failure, start at the end and work backwards
>>>>>
>>>>> devstack and grenade run under set -o errexit so that they will
>>>>> critically exit if a command fails. They will typically dump some debug
>>>>> when they do that. So the failing command won't be the last line in the
>>>>> file, but it will be close. The words 'error' typically aren't useful at
>>>>> all in shell because lots of things say error when they aren't, we mask
>>>>> their exit codes if their failure is generally irrelevant.
>>>>>
>>>>> #4 - general principle the closer to root cause the better
>>>>>
>>>>> If we think of exposure of bugs as layers we probably end up
>>>>> withsomething like this
>>>>>
>>>>> - Console log
>>>>> - Test Name + Failure
>>>>> - Failure inside an API service
>>>>> - Failure inside a worker process
>>>>> - Actual failure figured out in OpenStack code path
>>>>> - Failure in something below OpenStack (kernel, libvirt)
>>>>>
>>>>> This is why signatures that are just test names aren't all that useful
>>>>> much of the time (and why we try not to add those to ER), as that's
>>>>> going to be hitting an API, but the why of things is very much still
>>>>> undiscovered.
>>>>>
>>>>> #5 - if it's an infrastructure level setup bug (failing to download or
>>>>> install something) figure out if there are other likewise events at the
>>>>> same time (i.e. it's a network issue, which we can't fix) vs. a
>>>>> structural issue.
>>>>>
>>>>>
>>>>> I find Elastic Search good for step 5, but realistically for all other
>>>>> steps it's manual log sifting. I open lots of tabs in Chrome, and search
>>>>> by timestamp.
>>>>
>>>> This feels like something we could improve on. If we had a tool to download the logs and interleave the messages using their timestamps, would that make it easier? We could probably make the job log everything to a single file, but I can see where sometimes only having part of the data to look at would be more useful.
>>>
>>> Maybe, I find the ability to change the filtering level dynamically to
>>> be pretty important. We actually did some of this once when we used
>>> syslog. Personally I found it a ton harder to get to the bottom of things.
>>>
>>> A gate run has 25+ services running, it's a rare issue that combines
>>> interactions between > 4 of them to get to a solution. So I expect you'd
>>> exchange context jumping, for tons of irrelevancy. That's a personal
>>> opinion based on personal workflow, and why I never spent time on it.
>>> Instead I built os-loganalyze that does the filtering and coloring
>>> dynamically on the server side, as it was a zero install solution that
>>> provided additional benefits of being able to link to a timestamp in a
>>> log for sharing purposes.
>>
>> Sure, that makes sense.
>>
>>>
>>>>
>>>>>
>>>>>
>>>>> A big part of the experience also just comes from a manual bayesian
>>>>> filter. Certain scary looking things in the console log aren't, but you
>>>>> don't know that unless you look at setup logs enough (either in gate or
>>>>> in your own devstacks) to realize that. Sanitizing the output of that
>>>>> part of the process is pretty intractable... because shell (though I've
>>>>> put some serious effort into it over the last 6 months).
>>>>
>>>> Maybe our scripts can emit messages to explain the scary stuff? “This is going to report a problem, but you can ignore it unless X happens.”?
>>>
>>> Maybe, like I said it's a lot better than it used to be. But very few
>>> people are putting in effort here, and I'm not convinced it's really
>>> solveable in bash.
>>
>> OK, well, if the answers to these questions are “yes” then I should have time to help, which is why I’m exploring options.
>
> Yeh, the issue is you'd need a couple hundred different messages like
> that, and realistically I think they'd lead to more confusion rather
> than less.
>
> Honestly, I did a huge amount of selective filtering out of xtrace logs
> in the last six months and was able to drop the size of the devstack
> logs by over 50% getting rid of some of the more confusing trace bits.
> But it's something that you make progress on 1% at a time.
>
> At some point we do need to say "you have to understand OpenStack and
> the Test run process ^this much^ to be able to ride", because cleaning
> up every small thing isn't really possible.
>
> Now, providing a better flow explaining the parts here might be good. We
> do it during Infra bootcamps, and people find it helpful. But again,
> that's a mostly pull model because the people showing up did so
> specifically to learn, so are much more receptive to gaining the
> information at hand.
>
>>>>> Sanitizing the OpenStack logs to be crisp about actual things going
>>>>> wrong, vs. not, shouldn't be intractable, but it feels like it some
>>>>> times. Which is why all operators run at DEBUG level. The thing that
>>>>> makes it hard for developers to see the issues here is the same thing
>>>>> that makes it *really* hard for operators to figure out failures. It's
>>>>> also why I tried (though executed poorly on, sorry about that) getting
>>>>> log cleanups rolling this cycle.
>>>>
>>>> I would like to have the TC back an official cross-project effort to clean up the logs for Kilo, and get all of the integrated projects to commit to working on it as a priority.
>>>>
>>>> Doug
>>>>
>>>>>
>>>>> -Sean
>>>>>
>>>>> --
>>>>> Sean Dague
>>>>> http://dague.net
>>>>>
>>>>> _______________________________________________
>>>>> OpenStack-dev mailing list
>>>>> OpenStack-dev at lists.openstack.org
>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>>
>>>>
>>>> _______________________________________________
>>>> OpenStack-dev mailing list
>>>> OpenStack-dev at lists.openstack.org
>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>>
>>>
>>>
>>> --
>>> Sean Dague
>>> http://dague.net
>>>
>>> _______________________________________________
>>> OpenStack-dev mailing list
>>> OpenStack-dev at lists.openstack.org
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>>
>> _______________________________________________
>> OpenStack-dev mailing list
>> OpenStack-dev at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
>
> --
> Sean Dague
> http://dague.net
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
More information about the OpenStack-dev
mailing list