I tried to set up a test environment for this, but I'm having some issues. My local environment is defaulting to python 3, while the gate job appears to have been running under python 2. I'm not sure why it's doing that since the tox env definition doesn't specify python 3 (maybe something to do with https://review.openstack.org/#/c/622415/ ?), but either way I keep running into import issues. I'll take another look tomorrow, but in the meantime I'm afraid I haven't made any meaningful progress. :-( On 1/14/19 6:36 AM, Akihiro Motoki wrote:
The similar failure happens in neutron-fwaas. This blocks several patches in neutron-fwaas including policy-in-code support. https://bugs.launchpad.net/neutron/+bug/1811506
Most failures are fixed by applying Ben's neutron fix https://review.openstack.org/#/c/629335/ [1], but we still have one failure in neutron_fwaas.tests.functional.privileged.test_utils.InNamespaceTest.test_in_namespace [2]. This failure is caused by oslo.privsep 1.31.0 too. This does not happen with 1.30.1. Any help would be appreciated.
[1] neutron-fwaas change https://review.openstack.org/#/c/630451/ [2] http://logs.openstack.org/51/630451/2/check/legacy-neutron-fwaas-dsvm-functi...
-- Akihiro Motoki (irc: amotoki)
2019年1月9日(水) 9:32 Ben Nemec <openstack@nemebean.com <mailto:openstack@nemebean.com>>:
I think I've got it. At least in my local tests, the handle pointer being passed from C -> Python -> C was getting truncated at the Python step because we didn't properly define the type. If the address assigned was larger than would fit in a standard int then we passed what amounted to a bogus pointer back to the C code, which caused the segfault.
I have no idea why privsep threading would have exposed this, other than maybe running in threads affected the address space somehow?
In any case, https://review.openstack.org/629335 has got these functional tests working for me locally in oslo.privsep 1.31.0. It would be great if somebody could try them out and verify that I didn't just find a solution that somehow only works on my system. :-)
-Ben
On 1/8/19 4:30 PM, Ben Nemec wrote: > > > On 1/8/19 2:22 PM, Slawomir Kaplonski wrote: >> Hi Ben, >> >> I was also looking at it today. I’m totally not an C and Oslo.privsep >> expert but I think that there is some new process spawned here. >> I put pdb before line >> https://github.com/openstack/neutron/blob/master/neutron/privileged/agent/li...
>> where this issue happen. Then, with "ps aux” I saw: >> >> vagrant@fullstack-ubuntu ~ $ ps aux | grep privsep >> root 18368 0.1 0.5 185752 33544 pts/1 Sl+ 13:24 0:00 >> /opt/stack/neutron/.tox/dsvm-functional/bin/python >> /opt/stack/neutron/.tox/dsvm-functional/bin/privsep-helper >> --config-file neutron/tests/etc/neutron.conf --privsep_context >> neutron.privileged.default --privsep_sock_path >> /tmp/tmpG5iqb9/tmp1dMGq0/privsep.sock >> vagrant 18555 0.0 0.0 14512 1092 pts/2 S+ 13:25 0:00 grep >> --color=auto privsep >> >> But then when I continue run test, and it segfaulted, in journal log I >> have: >> >> Jan 08 13:25:29 fullstack-ubuntu kernel: privsep-helper[18369] >> segfault at 140043e8 ip 00007f8e1800ef32 sp 00007f8e18a63320 error 4 >> in libnetfilter_conntrack.so.3.5.0[7f8e18009000+1a000] >> >> Please check pics of those processes. First one (when test was >> „paused” with pdb) has 18368 and later segfault has 18369. > > privsep-helper does fork, so I _think_ that's normal. > > https://github.com/openstack/oslo.privsep/blob/ecb1870c29b760f09fb933fc8ebb3...
> > >> >> I don’t know if You saw my today’s comment in launchpad. I was trying >> to change method used to start PrivsepDaemon from Method.ROOTWRAP to >> Method.FORK (in >> https://github.com/openstack/oslo.privsep/blob/master/oslo_privsep/priv_cont...)
>> and run test as root, then tests were passed. > > Yeah, I saw that, but I don't understand it. :-/ > > The daemon should end up running with the same capabilities in either > case. By the time it starts making the C calls the environment should be > identical, regardless of which method was used to start the process. > >> >> — >> Slawek Kaplonski >> Senior software engineer >> Red Hat >> >>> Wiadomość napisana przez Ben Nemec <openstack@nemebean.com <mailto:openstack@nemebean.com>> w dniu >>> 08.01.2019, o godz. 20:04: >>> >>> Further update: I dusted off my gdb skills and attached it to the >>> privsep process to try to get more details about exactly what is >>> crashing. It looks like the segfault happens on this line: >>> >>> https://git.netfilter.org/libnetfilter_conntrack/tree/src/conntrack/api.c#n2...
>>> >>> >>> which is >>> >>> h->cb = cb; >>> >>> h being the conntrack handle and cb being the callback function. >>> >>> This makes me think the problem isn't the callback itself (even if we >>> assigned a bogus pointer, which we didn't, it shouldn't cause a >>> segfault unless you try to dereference it) but in the handle we pass >>> in. Trying to look at h->cb results in: >>> >>> (gdb) print h->cb >>> Cannot access memory at address 0x800f228 >>> >>> Interestingly, h itself is fine: >>> >>> (gdb) print h >>> $3 = (struct nfct_handle *) 0x800f1e0 >>> >>> It doesn't _look_ to me like the handle should be crossing any thread >>> boundaries or anything, so I'm not sure why it would be a problem. It >>> gets created in the same privileged function that ultimately >>> registers the callback: >>> https://github.com/openstack/neutron/blob/aa8a6ea848aae6882abb631b7089836dee...
>>> >>> >>> So still not sure what's going on, but I thought I'd share what I've >>> found before I stop to eat something. >>> >>> -Ben >>> >>> On 1/7/19 12:11 PM, Ben Nemec wrote: >>>> Renamed the thread to be more descriptive. >>>> Just to update the list on this, it looks like the problem is a >>>> segfault when the netlink_lib module makes a C call. Digging into >>>> that code a bit, it appears there is a callback being used[1]. I've >>>> seen some comments that when you use a callback with a Python >>>> thread, the thread needs to be registered somehow, but this is all >>>> uncharted territory for me. Suggestions gratefully accepted. :-) >>>> 1: >>>> https://github.com/openstack/neutron/blob/master/neutron/privileged/agent/li...
>>>> On 1/4/19 7:28 AM, Slawomir Kaplonski wrote: >>>>> Hi, >>>>> >>>>> I just found that functional tests in Neutron are failing since >>>>> today or maybe yesterday. See [1] >>>>> I was able to reproduce it locally and it looks that it happens >>>>> with oslo.privsep==1.31. With oslo.privsep==1.30.1 tests are fine. >>>>> >>>>> [1] https://bugs.launchpad.net/neutron/+bug/1810518 >>>>> >>>>> — >>>>> Slawek Kaplonski >>>>> Senior software engineer >>>>> Red Hat >>>>> >> >