[openstack-dev] [all] gate debugging

Salvatore Orlando sorlando at nicira.com
Wed Aug 27 22:36:47 UTC 2014


As it has been pointed out previously in this thread debugging gate
failures is mostly about chasing race conditions, which in some cases
involve the most disparate interactions between Openstack services [1].

Finding the root cause of these races is a mix of knowledge, pragmatism,
and luck. Having more people looking at gate failures can only be a good
thing.
While little can be done to transfer luck, good things can be written
regarding pragmatism and knowledge.

Knowledge is about knowing the tools, the infrastructure, and ultimately
the dynamics of the stuff that's being tested. This involves understanding
the zuul layout, devstack-gate, tempest, and most importantly logstash (in
my opinion). Unfortunately it's difficult to do this without being
sufficiently expert of the matter being tested.
For instance debugging a SSH failure with neutron involves knowledge of the
internals of neutron's l3 agent, ovs agent, metadata agent, the
nova/neutron interface, the nova/neutron notification system, nova's
network info instance cache and so on.

Pragmatism is about writing down and sharing the process followed for
triaging gate failures, especially when it comes to analysing openstack's
logs. Different people might be using different processes, and sharing them
can only be good.

To this aim the Neutron community has tried to put these things in writing
in this unfinished effort [2]. Hopefully there could be a wiki page (or a
set of pages) like this not limited to neutron only but to the whole set of
projects tested in the integrated gate.

This effort can also constitute a basis for improving the process. As an
example, event correlation in logs, ability of validating hypothesis by
correlating traces for failures manifestation with potential traces for
root causes are two areas with room for improvement.

Salvatore

[1] https://bugs.launchpad.net/neutron/+bug/1273386
[2] https://wiki.openstack.org/wiki/NeutronGateFailureTriage


On 28 August 2014 00:32, Matthew Treinish <mtreinish at kortar.org> wrote:

> On Wed, Aug 27, 2014 at 05:47:09PM -0400, Doug Hellmann wrote:
> >
> > On Aug 27, 2014, at 5:27 PM, Doug Hellmann <doug at doughellmann.com>
> wrote:
> >
> > >
> > > On Aug 27, 2014, at 2:54 PM, Sean Dague <sean at dague.net> wrote:
> > >
> > >> Note: thread intentionally broken, this is really a different topic.
> > >>
> > >> On 08/27/2014 02:30 PM, Doug Hellmann wrote:>
> > >>> On Aug 27, 2014, at 1:30 PM, Chris Dent <chdent at redhat.com> wrote:
> > >>>
> > >>>> On Wed, 27 Aug 2014, Doug Hellmann wrote:
> > >>>>
> > >>>>> I have found it immensely helpful, for example, to have a written
> set
> > >>>>> of the steps involved in creating a new library, from importing the
> > >>>>> git repo all the way through to making it available to other
> projects.
> > >>>>> Without those instructions, it would have been much harder to
> split up
> > >>>>> the work. The team would have had to train each other by word of
> > >>>>> mouth, and we would have had constant issues with inconsistent
> > >>>>> approaches triggering different failures. The time we spent
> building
> > >>>>> and verifying the instructions has paid off to the extent that we
> even
> > >>>>> had one developer not on the core team handle a graduation for us.
> > >>>>
> > >>>> +many more for the relatively simple act of just writing stuff down
> > >>>
> > >>> "Write it down.” is my theme for Kilo.
> > >>
> > >> I definitely get the sentiment. "Write it down" is also hard when you
> > >> are talking about things that do change around quite a bit. OpenStack
> as
> > >> a whole sees 250 - 500 changes a week, so the interaction pattern
> moves
> > >> around enough that it's really easy to have *very* stale information
> > >> written down. Stale information is even more dangerous than no
> > >> information some times, as it takes people down very wrong paths.
> > >>
> > >> I think we break down on communication when we get into a conversation
> > >> of "I want to learn gate debugging" because I don't quite know what
> that
> > >> means, or where the starting point of understanding is. So those
> > >> intentions are well meaning, but tend to stall. The reality was there
> > >> was no road map for those of us that dive in, it's just understanding
> > >> how OpenStack holds together as a whole and where some of the high
> risk
> > >> parts are. And a lot of that comes with days staring at code and logs
> > >> until patterns emerge.
> > >>
> > >> Maybe if we can get smaller more targeted questions, we can help folks
> > >> better? I'm personally a big fan of answering the targeted questions
> > >> because then I also know that the time spent exposing that information
> > >> was directly useful.
> > >>
> > >> I'm more than happy to mentor folks. But I just end up finding the "I
> > >> want to learn" at the generic level something that's hard to grasp
> onto
> > >> or figure out how we turn it into action. I'd love to hear more ideas
> > >> from folks about ways we might do that better.
> > >
> > > You and a few others have developed an expertise in this important
> skill. I am so far away from that level of expertise that I don’t know the
> questions to ask. More often than not I start with the console log, find
> something that looks significant, spend an hour or so tracking it down, and
> then have someone tell me that it is a red herring and the issue is really
> some other thing that they figured out very quickly by looking at a file I
> never got to.
> > >
> > > I guess what I’m looking for is some help with the patterns. What made
> you think to look in one log file versus another? Some of these jobs save a
> zillion little files, which ones are actually useful? What tools are you
> using to correlate log entries across all of those files? Are you doing it
> by hand? Is logstash useful for that, or is that more useful for finding
> multiple occurrences of the same issue?
> > >
> > > I realize there’s not a way to write a how-to that will live forever.
> Maybe one way to deal with that is to write up the research done on bugs
> soon after they are solved, and publish that to the mailing list. Even the
> retrospective view is useful because we can all learn from it without
> having to live through it. The mailing list is a fairly ephemeral medium,
> and something very old in the archives is understood to have a good chance
> of being out of date so we don’t have to keep adding disclaimers.
> >
> > Matt’s blog post [1] is an example of the sort of thing I think would be
> helpful. Obviously one post isn’t going to make the reader an expert, but
> over time a few of these will impart some useful knowledge.
> >
> > Doug
> >
> > [1]
> http://blog.kortar.org/?p=52&draftsforfriends=cTT3WsXqsH66eEt6uoi9rQaL2vGc8Vde
>
> So that was just an expiring link (which shouldn't be valid anymore) to the
> draft which I generated to get some initial feedback before I posted it.
> The
> permanent link to the post is here:
>
> http://blog.kortar.org/?p=52
>
>
> -Matt Treinish
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140828/85d2c6a4/attachment.html>


More information about the OpenStack-dev mailing list