[oslo][neutron] Neutron Functional Test Failures with oslo.privsep 1.31.0

Akihiro Motoki amotoki at gmail.com
Mon Jan 14 12:36:42 UTC 2019


The similar failure happens in neutron-fwaas. This blocks several patches
in neutron-fwaas including policy-in-code support.
https://bugs.launchpad.net/neutron/+bug/1811506

Most failures are fixed by applying Ben's neutron fix
https://review.openstack.org/#/c/629335/ [1],
but we still have one failure
in neutron_fwaas.tests.functional.privileged.test_utils.InNamespaceTest.test_in_namespace
[2].
This failure is caused by oslo.privsep 1.31.0 too. This does not happen
with 1.30.1.
Any help would be appreciated.

[1] neutron-fwaas change https://review.openstack.org/#/c/630451/
[2]
http://logs.openstack.org/51/630451/2/check/legacy-neutron-fwaas-dsvm-functional/05b9131/logs/testr_results.html.gz

-- 
Akihiro Motoki (irc: amotoki)


2019年1月9日(水) 9:32 Ben Nemec <openstack at nemebean.com>:

> I think I've got it. At least in my local tests, the handle pointer
> being passed from C -> Python -> C was getting truncated at the Python
> step because we didn't properly define the type. If the address assigned
> was larger than would fit in a standard int then we passed what amounted
> to a bogus pointer back to the C code, which caused the segfault.
>
> I have no idea why privsep threading would have exposed this, other than
> maybe running in threads affected the address space somehow?
>
> In any case, https://review.openstack.org/629335 has got these
> functional tests working for me locally in oslo.privsep 1.31.0. It would
> be great if somebody could try them out and verify that I didn't just
> find a solution that somehow only works on my system. :-)
>
> -Ben
>
> On 1/8/19 4:30 PM, Ben Nemec wrote:
> >
> >
> > On 1/8/19 2:22 PM, Slawomir Kaplonski wrote:
> >> Hi Ben,
> >>
> >> I was also looking at it today. I’m totally not an C and Oslo.privsep
> >> expert but I think that there is some new process spawned here.
> >> I put pdb before line
> >>
> https://github.com/openstack/neutron/blob/master/neutron/privileged/agent/linux/netlink_lib.py#L191
> >> where this issue happen. Then, with "ps aux” I saw:
> >>
> >> vagrant at fullstack-ubuntu ~ $ ps aux | grep privsep
> >> root     18368  0.1  0.5 185752 33544 pts/1    Sl+  13:24   0:00
> >> /opt/stack/neutron/.tox/dsvm-functional/bin/python
> >> /opt/stack/neutron/.tox/dsvm-functional/bin/privsep-helper
> >> --config-file neutron/tests/etc/neutron.conf --privsep_context
> >> neutron.privileged.default --privsep_sock_path
> >> /tmp/tmpG5iqb9/tmp1dMGq0/privsep.sock
> >> vagrant  18555  0.0  0.0  14512  1092 pts/2    S+   13:25   0:00 grep
> >> --color=auto privsep
> >>
> >> But then when I continue run test, and it segfaulted, in journal log I
> >> have:
> >>
> >> Jan 08 13:25:29 fullstack-ubuntu kernel: privsep-helper[18369]
> >> segfault at 140043e8 ip 00007f8e1800ef32 sp 00007f8e18a63320 error 4
> >> in libnetfilter_conntrack.so.3.5.0[7f8e18009000+1a000]
> >>
> >> Please check pics of those processes. First one (when test was
> >> „paused” with pdb) has 18368 and later segfault has 18369.
> >
> > privsep-helper does fork, so I _think_ that's normal.
> >
> >
> https://github.com/openstack/oslo.privsep/blob/ecb1870c29b760f09fb933fc8ebb3eac29ffd03e/oslo_privsep/daemon.py#L539
> >
> >
> >>
> >> I don’t know if You saw my today’s comment in launchpad. I was trying
> >> to change method used to start PrivsepDaemon from Method.ROOTWRAP to
> >> Method.FORK (in
> >>
> https://github.com/openstack/oslo.privsep/blob/master/oslo_privsep/priv_context.py#L218)
>
> >> and run test as root, then tests were passed.
> >
> > Yeah, I saw that, but I don't understand it. :-/
> >
> > The daemon should end up running with the same capabilities in either
> > case. By the time it starts making the C calls the environment should be
> > identical, regardless of which method was used to start the process.
> >
> >>
> >> —
> >> Slawek Kaplonski
> >> Senior software engineer
> >> Red Hat
> >>
> >>> Wiadomość napisana przez Ben Nemec <openstack at nemebean.com> w dniu
> >>> 08.01.2019, o godz. 20:04:
> >>>
> >>> Further update: I dusted off my gdb skills and attached it to the
> >>> privsep process to try to get more details about exactly what is
> >>> crashing. It looks like the segfault happens on this line:
> >>>
> >>>
> https://git.netfilter.org/libnetfilter_conntrack/tree/src/conntrack/api.c#n239
> >>>
> >>>
> >>> which is
> >>>
> >>> h->cb = cb;
> >>>
> >>> h being the conntrack handle and cb being the callback function.
> >>>
> >>> This makes me think the problem isn't the callback itself (even if we
> >>> assigned a bogus pointer, which we didn't, it shouldn't cause a
> >>> segfault unless you try to dereference it) but in the handle we pass
> >>> in. Trying to look at h->cb results in:
> >>>
> >>> (gdb) print h->cb
> >>> Cannot access memory at address 0x800f228
> >>>
> >>> Interestingly, h itself is fine:
> >>>
> >>> (gdb) print h
> >>> $3 = (struct nfct_handle *) 0x800f1e0
> >>>
> >>> It doesn't _look_ to me like the handle should be crossing any thread
> >>> boundaries or anything, so I'm not sure why it would be a problem. It
> >>> gets created in the same privileged function that ultimately
> >>> registers the callback:
> >>>
> https://github.com/openstack/neutron/blob/aa8a6ea848aae6882abb631b7089836dee8f4008/neutron/privileged/agent/linux/netlink_lib.py#L246
> >>>
> >>>
> >>> So still not sure what's going on, but I thought I'd share what I've
> >>> found before I stop to eat something.
> >>>
> >>> -Ben
> >>>
> >>> On 1/7/19 12:11 PM, Ben Nemec wrote:
> >>>> Renamed the thread to be more descriptive.
> >>>> Just to update the list on this, it looks like the problem is a
> >>>> segfault when the netlink_lib module makes a C call. Digging into
> >>>> that code a bit, it appears there is a callback being used[1]. I've
> >>>> seen some comments that when you use a callback with a Python
> >>>> thread, the thread needs to be registered somehow, but this is all
> >>>> uncharted territory for me. Suggestions gratefully accepted. :-)
> >>>> 1:
> >>>>
> https://github.com/openstack/neutron/blob/master/neutron/privileged/agent/linux/netlink_lib.py#L136
> >>>> On 1/4/19 7:28 AM, Slawomir Kaplonski wrote:
> >>>>> Hi,
> >>>>>
> >>>>> I just found that functional tests in Neutron are failing since
> >>>>> today or maybe yesterday. See [1]
> >>>>> I was able to reproduce it locally and it looks that it happens
> >>>>> with oslo.privsep==1.31. With oslo.privsep==1.30.1 tests are fine.
> >>>>>
> >>>>> [1] https://bugs.launchpad.net/neutron/+bug/1810518
> >>>>>
> >>>>> —
> >>>>> Slawek Kaplonski
> >>>>> Senior software engineer
> >>>>> Red Hat
> >>>>>
> >>
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20190114/94d56b68/attachment-0001.html>


More information about the openstack-discuss mailing list