[openstack-dev] Gate Status - Friday Edition

Salvatore Orlando sorlando at nicira.com
Sat Jan 25 09:08:24 UTC 2014


Thanks Chris!

some comments inline.


On 25 January 2014 02:08, Chris Wright <chrisw at sous-sol.org> wrote:

> * Salvatore Orlando (sorlando at nicira.com) wrote:
> > I've found out that several jobs are exhibiting failures like bug 1254890
> > [1] and bug 1253896 [2] because openvswitch seem to be crashing the
> kernel.
> > The kernel trace reports as offending process usually either
> > neutron-ns-metadata-proxy or dnsmasq, but [3] seem to clearly point to
> > ovs-vsctl.
>
> Hmm, that actually shows dnsmasq is the running/exiting process.
> The ovs-vsctl was run nearly a half-second earlier.  Looks like
> ovs-vsctl successfuly added the tap device (assuming it's for
> dnsmasq?).


I think you're right. The most reliable source of information should be the
crash dump. And the fact that there are always ovs operations near the
crash might point to a namespace issue due to the way neutron operates. I
understand very little about kernel issues, but the trace is very similar
to another namespace-related issue we saw back in october.


> And dnsmasq is exiting upon receiving a signal.  Shot in
> the dark, has the neutron path that would end up killing dnsmasq
> (Dnsmasq::reload_allocations()) changed recently?  I didn't see much.
>

Nope, that has not changed in a while. Last commit that edited it
is: 9274095b4af63de7224b524e482872a78e027a7b
However, most of the crashes occur with the metadata proxy, which runs in a
namespace and forwards traffic to the metadata agent through a unix socket.
Unfortunately logging is not optimal for these proxies as they're spawned
by the l3 agent and logs are partially collected within the l3 agent log.
I'll see if anything can be done to improve their logging.


> > 254 events observed in the previous 6 days show a similar trace in the
> logs
> > [4].
>
> That kernel (3.2.0) is over a year old.  And there have been some network
> namespace fixes since then (IIRC, refcounting related).
>

I would surely consider upgrading the kernel, if that is feasible. But in
the meanwhile I think we should focus on identifying which change started
to trigger all these kernel crashes.


>
> > This means that while this alone won't explain all the failures observed,
> > it is however potentially one of the prominent root causes.
> >
> > >From the logs I have little hints about the kernel running. It seems
> there
> > has been no update in the past 7 days, but I can't be sure.
> > Openvswitch builds are updated periodically. The last build I found not
> to
> > trigger failures was the one generated on 2014/01/16 at 01:58:18.
> > Unfortunately version-wise I always have only 1.4.0, no build number.
> >
> > I don't know if this will require getting in touch with ubuntu, or if we
> > can just prep a different image which an OVS build known to work without
> > problems.
> >
> > Salvatore
> >
> > [1] https://bugs.launchpad.net/neutron/+bug/1254890
> > [2] https://bugs.launchpad.net/neutron/+bug/1253896
> > [3] http://paste.openstack.org/show/61869/
> > [4] "kernel BUG at /build/buildd/linux-3.2.0/fs/buffer.c:2917" and
> > filename:syslog.txt
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140125/1d1bba71/attachment.html>


More information about the OpenStack-dev mailing list