[openstack-dev] [all] gate debugging

Sean Dague sean at dague.net
Wed Aug 27 21:56:44 UTC 2014


On 08/27/2014 05:27 PM, Doug Hellmann wrote:
> 
> On Aug 27, 2014, at 2:54 PM, Sean Dague <sean at dague.net> wrote:
> 
>> Note: thread intentionally broken, this is really a different topic.
>>
>> On 08/27/2014 02:30 PM, Doug Hellmann wrote:>
>>> On Aug 27, 2014, at 1:30 PM, Chris Dent <chdent at redhat.com> wrote:
>>>
>>>> On Wed, 27 Aug 2014, Doug Hellmann wrote:
>>>>
>>>>> I have found it immensely helpful, for example, to have a written set
>>>>> of the steps involved in creating a new library, from importing the
>>>>> git repo all the way through to making it available to other projects.
>>>>> Without those instructions, it would have been much harder to split up
>>>>> the work. The team would have had to train each other by word of
>>>>> mouth, and we would have had constant issues with inconsistent
>>>>> approaches triggering different failures. The time we spent building
>>>>> and verifying the instructions has paid off to the extent that we even
>>>>> had one developer not on the core team handle a graduation for us.
>>>>
>>>> +many more for the relatively simple act of just writing stuff down
>>>
>>> "Write it down.” is my theme for Kilo.
>>
>> I definitely get the sentiment. "Write it down" is also hard when you
>> are talking about things that do change around quite a bit. OpenStack as
>> a whole sees 250 - 500 changes a week, so the interaction pattern moves
>> around enough that it's really easy to have *very* stale information
>> written down. Stale information is even more dangerous than no
>> information some times, as it takes people down very wrong paths.
>>
>> I think we break down on communication when we get into a conversation
>> of "I want to learn gate debugging" because I don't quite know what that
>> means, or where the starting point of understanding is. So those
>> intentions are well meaning, but tend to stall. The reality was there
>> was no road map for those of us that dive in, it's just understanding
>> how OpenStack holds together as a whole and where some of the high risk
>> parts are. And a lot of that comes with days staring at code and logs
>> until patterns emerge.
>>
>> Maybe if we can get smaller more targeted questions, we can help folks
>> better? I'm personally a big fan of answering the targeted questions
>> because then I also know that the time spent exposing that information
>> was directly useful.
>>
>> I'm more than happy to mentor folks. But I just end up finding the "I
>> want to learn" at the generic level something that's hard to grasp onto
>> or figure out how we turn it into action. I'd love to hear more ideas
>> from folks about ways we might do that better.
> 
> You and a few others have developed an expertise in this important skill. I am so far away from that level of expertise that I don’t know the questions to ask. More often than not I start with the console log, find something that looks significant, spend an hour or so tracking it down, and then have someone tell me that it is a red herring and the issue is really some other thing that they figured out very quickly by looking at a file I never got to.
> 
> I guess what I’m looking for is some help with the patterns. What made you think to look in one log file versus another? Some of these jobs save a zillion little files, which ones are actually useful? What tools are you using to correlate log entries across all of those files? Are you doing it by hand? Is logstash useful for that, or is that more useful for finding multiple occurrences of the same issue?
> 
> I realize there’s not a way to write a how-to that will live forever. Maybe one way to deal with that is to write up the research done on bugs soon after they are solved, and publish that to the mailing list. Even the retrospective view is useful because we can all learn from it without having to live through it. The mailing list is a fairly ephemeral medium, and something very old in the archives is understood to have a good chance of being out of date so we don’t have to keep adding disclaimers.

Sure. Matt's actually working up a blog post describing the thing he
nailed earlier in the week.

Here is my off the cuff set of guidelines:

#1 - is it a test failure or a setup failure

This should be pretty easy to figure out. Test failures come at the end
of console log and say that tests failed (after you see a bunch of
passing tempest tests).

Always start at *the end* of files and work backwards.

#2 - if it's a test failure, what API call was unsuccessful.

Start with looking at the API logs for the service at the top level, and
see if there is a simple traceback at the right timestamp. If not,
figure out what that API call was calling out to, again look at the
simple cases assuming failures will create ERRORS or TRACES (though they
often don't).

Hints on the service log order you should go after are on the footer
over every log page -
http://logs.openstack.org/76/79776/15/gate/gate-tempest-dsvm-full/700ee7e/logs/
(it's included as an Apache footer) for some services. It's been there
for about 18 months, I think people are fully blind to it at this point.

If nothing jumps out at ERROR or TRACE, go back to DEBUG level and
figure out what's happening at the time of failure, especially keeping
an eye out of areas where other workers are doing interesting things at
the same time, possibly indicating state corruption in OpenStack as a race.

#3 - if it's a console failure, start at the end and work backwards

devstack and grenade run under set -o errexit so that they will
critically exit if a command fails. They will typically dump some debug
when they do that. So the failing command won't be the last line in the
file, but it will be close. The words 'error' typically aren't useful at
all in shell because lots of things say error when they aren't, we mask
their exit codes if their failure is generally irrelevant.

#4 - general principle the closer to root cause the better

If we think of exposure of bugs as layers we probably end up
withsomething like this

- Console log
- Test Name + Failure
- Failure inside an API service
- Failure inside a worker process
- Actual failure figured out in OpenStack code path
- Failure in something below OpenStack (kernel, libvirt)

This is why signatures that are just test names aren't all that useful
much of the time (and why we try not to add those to ER), as that's
going to be hitting an API, but the why of things is very much still
undiscovered.

#5 - if it's an infrastructure level setup bug (failing to download or
install something) figure out if there are other likewise events at the
same time (i.e. it's a network issue, which we can't fix) vs. a
structural issue.


I find Elastic Search good for step 5, but realistically for all other
steps it's manual log sifting. I open lots of tabs in Chrome, and search
by timestamp.


A big part of the experience also just comes from a manual bayesian
filter. Certain scary looking things in the console log aren't, but you
don't know that unless you look at setup logs enough (either in gate or
in your own devstacks) to realize that. Sanitizing the output of that
part of the process is pretty intractable... because shell (though I've
put some serious effort into it over the last 6 months).

Sanitizing the OpenStack logs to be crisp about actual things going
wrong, vs. not, shouldn't be intractable, but it feels like it some
times. Which is why all operators run at DEBUG level. The thing that
makes it hard for developers to see the issues here is the same thing
that makes it *really* hard for operators to figure out failures. It's
also why I tried (though executed poorly on, sorry about that) getting
log cleanups rolling this cycle.

	-Sean

-- 
Sean Dague
http://dague.net

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 482 bytes
Desc: OpenPGP digital signature
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140827/8a05589c/attachment.pgp>


More information about the OpenStack-dev mailing list