[openstack-dev] [all] gate debugging
Joe Gordon
joe.gordon0 at gmail.com
Thu Aug 28 18:07:28 UTC 2014
On Thu, Aug 28, 2014 at 10:17 AM, Sean Dague <sean at dague.net> wrote:
> On 08/28/2014 12:48 PM, Doug Hellmann wrote:
> >
> > On Aug 27, 2014, at 5:56 PM, Sean Dague <sean at dague.net> wrote:
> >
> >> On 08/27/2014 05:27 PM, Doug Hellmann wrote:
> >>>
> >>> On Aug 27, 2014, at 2:54 PM, Sean Dague <sean at dague.net> wrote:
> >>>
> >>>> Note: thread intentionally broken, this is really a different topic.
> >>>>
> >>>> On 08/27/2014 02:30 PM, Doug Hellmann wrote:>
> >>>>> On Aug 27, 2014, at 1:30 PM, Chris Dent <chdent at redhat.com> wrote:
> >>>>>
> >>>>>> On Wed, 27 Aug 2014, Doug Hellmann wrote:
> >>>>>>
> >>>>>>> I have found it immensely helpful, for example, to have a written
> set
> >>>>>>> of the steps involved in creating a new library, from importing the
> >>>>>>> git repo all the way through to making it available to other
> projects.
> >>>>>>> Without those instructions, it would have been much harder to
> split up
> >>>>>>> the work. The team would have had to train each other by word of
> >>>>>>> mouth, and we would have had constant issues with inconsistent
> >>>>>>> approaches triggering different failures. The time we spent
> building
> >>>>>>> and verifying the instructions has paid off to the extent that we
> even
> >>>>>>> had one developer not on the core team handle a graduation for us.
> >>>>>>
> >>>>>> +many more for the relatively simple act of just writing stuff down
> >>>>>
> >>>>> "Write it down.” is my theme for Kilo.
> >>>>
> >>>> I definitely get the sentiment. "Write it down" is also hard when you
> >>>> are talking about things that do change around quite a bit. OpenStack
> as
> >>>> a whole sees 250 - 500 changes a week, so the interaction pattern
> moves
> >>>> around enough that it's really easy to have *very* stale information
> >>>> written down. Stale information is even more dangerous than no
> >>>> information some times, as it takes people down very wrong paths.
> >>>>
> >>>> I think we break down on communication when we get into a conversation
> >>>> of "I want to learn gate debugging" because I don't quite know what
> that
> >>>> means, or where the starting point of understanding is. So those
> >>>> intentions are well meaning, but tend to stall. The reality was there
> >>>> was no road map for those of us that dive in, it's just understanding
> >>>> how OpenStack holds together as a whole and where some of the high
> risk
> >>>> parts are. And a lot of that comes with days staring at code and logs
> >>>> until patterns emerge.
> >>>>
> >>>> Maybe if we can get smaller more targeted questions, we can help folks
> >>>> better? I'm personally a big fan of answering the targeted questions
> >>>> because then I also know that the time spent exposing that information
> >>>> was directly useful.
> >>>>
> >>>> I'm more than happy to mentor folks. But I just end up finding the "I
> >>>> want to learn" at the generic level something that's hard to grasp
> onto
> >>>> or figure out how we turn it into action. I'd love to hear more ideas
> >>>> from folks about ways we might do that better.
> >>>
> >>> You and a few others have developed an expertise in this important
> skill. I am so far away from that level of expertise that I don’t know the
> questions to ask. More often than not I start with the console log, find
> something that looks significant, spend an hour or so tracking it down, and
> then have someone tell me that it is a red herring and the issue is really
> some other thing that they figured out very quickly by looking at a file I
> never got to.
> >>>
> >>> I guess what I’m looking for is some help with the patterns. What made
> you think to look in one log file versus another? Some of these jobs save a
> zillion little files, which ones are actually useful? What tools are you
> using to correlate log entries across all of those files? Are you doing it
> by hand? Is logstash useful for that, or is that more useful for finding
> multiple occurrences of the same issue?
> >>>
> >>> I realize there’s not a way to write a how-to that will live forever.
> Maybe one way to deal with that is to write up the research done on bugs
> soon after they are solved, and publish that to the mailing list. Even the
> retrospective view is useful because we can all learn from it without
> having to live through it. The mailing list is a fairly ephemeral medium,
> and something very old in the archives is understood to have a good chance
> of being out of date so we don’t have to keep adding disclaimers.
> >>
> >> Sure. Matt's actually working up a blog post describing the thing he
> >> nailed earlier in the week.
> >
> > Yes, I appreciate that both of you are responding to my questions. :-)
> >
> > I have some more specific questions/comments below. Please take all of
> this in the spirit of trying to make this process easier by pointing out
> where I’ve found it hard, and not just me complaining. I’d like to work on
> fixing any of these things that can be fixed, by writing or reviewing
> patches for early in kilo.
> >
> >>
> >> Here is my off the cuff set of guidelines:
> >>
> >> #1 - is it a test failure or a setup failure
> >>
> >> This should be pretty easy to figure out. Test failures come at the end
> >> of console log and say that tests failed (after you see a bunch of
> >> passing tempest tests).
> >>
> >> Always start at *the end* of files and work backwards.
> >
> > That’s interesting because in my case I saw a lot of failures after the
> initial “real” problem. So I usually read the logs like C compiler output:
> Assume the first error is real, and the others might have been caused by
> that one. Do you work from the bottom up to a point where you don’t see any
> more errors instead of reading top down?
>
> Bottom up to get to problems, then figure out if it's in a subprocess so
> the problems could exist for a while. That being said, not all tools do
> useful things like actually error when they fail (I'm looking at you
> yum....) so there are always edge cases here.
>
> >>
> >> #2 - if it's a test failure, what API call was unsuccessful.
> >>
> >> Start with looking at the API logs for the service at the top level, and
> >> see if there is a simple traceback at the right timestamp. If not,
> >> figure out what that API call was calling out to, again look at the
> >> simple cases assuming failures will create ERRORS or TRACES (though they
> >> often don't).
> >
> > In my case, a neutron call failed. Most of the other services seem to
> have a *-api.log file, but neutron doesn’t. It took a little while to find
> the API-related messages in screen-q-svc.txt (I’m glad I’ve been around
> long enough to know it used to be called “quantum”). I get that
> screen-n-*.txt would collide with nova. Is it necessary to abbreviate those
> filenames at all?
>
> Yeh... service naming could definitely be better, especially with
> neutron. There are implications for long names in screen, but maybe we
> just get over it as we already have too many tabs to be in one page in
> the console anymore anyway.
>
> >> Hints on the service log order you should go after are on the footer
> >> over every log page -
> >>
> http://logs.openstack.org/76/79776/15/gate/gate-tempest-dsvm-full/700ee7e/logs/
> >> (it's included as an Apache footer) for some services. It's been there
> >> for about 18 months, I think people are fully blind to it at this point.
> >
> > Where would I go to edit that footer to add information about the
> neutron log files? Is that Apache footer defined in an infra repo?
>
> Note the following at the end of the footer output:
>
> About this Help
>
> This help file is part of the openstack-infra/config project, and can be
> found at modules/openstack_project/files/logs/help/tempest_logs.html .
> The file can be updated via the standard OpenStack Gerrit Review process.
>
I took a first whack at trying to add some more information to the footer
here: https://review.openstack.org/#/c/117390/
>
> > Another specific issue I’ve seen is a message that says something to the
> effect “the setup for this job failed, check the appropriate log”. I found
> 2 files with “setup” in the name, but the failure was actually logged in a
> different file (devstacklog.txt). Is the job definition too far “removed”
> from the scripts to know what the real filename is? Is it running scripts
> that log to multiple files during the setup phase, and so it doesn’t know
> which to refer me to? Or maybe I overlooked a message about when logging to
> a specific file started.
>
> Part of the issue here is that devstack-gate runs a lot of different
> gate_hooks. So that's about as specific as we can get unless you can
> figure out how to introspect that info in bash... which I couldn't.
>
> >> If nothing jumps out at ERROR or TRACE, go back to DEBUG level and
> >> figure out what's happening at the time of failure, especially keeping
> >> an eye out of areas where other workers are doing interesting things at
> >> the same time, possibly indicating state corruption in OpenStack as a
> race.
> >>
> >> #3 - if it's a console failure, start at the end and work backwards
> >>
> >> devstack and grenade run under set -o errexit so that they will
> >> critically exit if a command fails. They will typically dump some debug
> >> when they do that. So the failing command won't be the last line in the
> >> file, but it will be close. The words 'error' typically aren't useful at
> >> all in shell because lots of things say error when they aren't, we mask
> >> their exit codes if their failure is generally irrelevant.
> >>
> >> #4 - general principle the closer to root cause the better
> >>
> >> If we think of exposure of bugs as layers we probably end up
> >> withsomething like this
> >>
> >> - Console log
> >> - Test Name + Failure
> >> - Failure inside an API service
> >> - Failure inside a worker process
> >> - Actual failure figured out in OpenStack code path
> >> - Failure in something below OpenStack (kernel, libvirt)
> >>
> >> This is why signatures that are just test names aren't all that useful
> >> much of the time (and why we try not to add those to ER), as that's
> >> going to be hitting an API, but the why of things is very much still
> >> undiscovered.
> >>
> >> #5 - if it's an infrastructure level setup bug (failing to download or
> >> install something) figure out if there are other likewise events at the
> >> same time (i.e. it's a network issue, which we can't fix) vs. a
> >> structural issue.
> >>
> >>
> >> I find Elastic Search good for step 5, but realistically for all other
> >> steps it's manual log sifting. I open lots of tabs in Chrome, and search
> >> by timestamp.
> >
> > This feels like something we could improve on. If we had a tool to
> download the logs and interleave the messages using their timestamps, would
> that make it easier? We could probably make the job log everything to a
> single file, but I can see where sometimes only having part of the data to
> look at would be more useful.
>
> Maybe, I find the ability to change the filtering level dynamically to
> be pretty important. We actually did some of this once when we used
> syslog. Personally I found it a ton harder to get to the bottom of things.
>
> A gate run has 25+ services running, it's a rare issue that combines
> interactions between > 4 of them to get to a solution. So I expect you'd
> exchange context jumping, for tons of irrelevancy. That's a personal
> opinion based on personal workflow, and why I never spent time on it.
> Instead I built os-loganalyze that does the filtering and coloring
> dynamically on the server side, as it was a zero install solution that
> provided additional benefits of being able to link to a timestamp in a
> log for sharing purposes.
>
> >
> >>
> >>
> >> A big part of the experience also just comes from a manual bayesian
> >> filter. Certain scary looking things in the console log aren't, but you
> >> don't know that unless you look at setup logs enough (either in gate or
> >> in your own devstacks) to realize that. Sanitizing the output of that
> >> part of the process is pretty intractable... because shell (though I've
> >> put some serious effort into it over the last 6 months).
> >
> > Maybe our scripts can emit messages to explain the scary stuff? “This is
> going to report a problem, but you can ignore it unless X happens.”?
>
> Maybe, like I said it's a lot better than it used to be. But very few
> people are putting in effort here, and I'm not convinced it's really
> solveable in bash.
>
> >> Sanitizing the OpenStack logs to be crisp about actual things going
> >> wrong, vs. not, shouldn't be intractable, but it feels like it some
> >> times. Which is why all operators run at DEBUG level. The thing that
> >> makes it hard for developers to see the issues here is the same thing
> >> that makes it *really* hard for operators to figure out failures. It's
> >> also why I tried (though executed poorly on, sorry about that) getting
> >> log cleanups rolling this cycle.
> >
> > I would like to have the TC back an official cross-project effort to
> clean up the logs for Kilo, and get all of the integrated projects to
> commit to working on it as a priority.
> >
> > Doug
> >
> >>
> >> -Sean
> >>
> >> --
> >> Sean Dague
> >> http://dague.net
> >>
> >> _______________________________________________
> >> OpenStack-dev mailing list
> >> OpenStack-dev at lists.openstack.org
> >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
> >
> > _______________________________________________
> > OpenStack-dev mailing list
> > OpenStack-dev at lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
>
>
> --
> Sean Dague
> http://dague.net
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140828/63d08cc3/attachment.html>
More information about the OpenStack-dev
mailing list