[oslo][neutron] Neutron Functional Test Failures with oslo.privsep 1.31.0

Ben Nemec openstack at nemebean.com
Wed Jan 9 00:30:19 UTC 2019

I think I've got it. At least in my local tests, the handle pointer 
being passed from C -> Python -> C was getting truncated at the Python 
step because we didn't properly define the type. If the address assigned 
was larger than would fit in a standard int then we passed what amounted 
to a bogus pointer back to the C code, which caused the segfault.

I have no idea why privsep threading would have exposed this, other than 
maybe running in threads affected the address space somehow?

In any case, https://review.openstack.org/629335 has got these 
functional tests working for me locally in oslo.privsep 1.31.0. It would 
be great if somebody could try them out and verify that I didn't just 
find a solution that somehow only works on my system. :-)


On 1/8/19 4:30 PM, Ben Nemec wrote:
> On 1/8/19 2:22 PM, Slawomir Kaplonski wrote:
>> Hi Ben,
>> I was also looking at it today. I’m totally not an C and Oslo.privsep 
>> expert but I think that there is some new process spawned here.
>> I put pdb before line 
>> https://github.com/openstack/neutron/blob/master/neutron/privileged/agent/linux/netlink_lib.py#L191 
>> where this issue happen. Then, with "ps aux” I saw:
>> vagrant at fullstack-ubuntu ~ $ ps aux | grep privsep
>> root     18368  0.1  0.5 185752 33544 pts/1    Sl+  13:24   0:00 
>> /opt/stack/neutron/.tox/dsvm-functional/bin/python 
>> /opt/stack/neutron/.tox/dsvm-functional/bin/privsep-helper 
>> --config-file neutron/tests/etc/neutron.conf --privsep_context 
>> neutron.privileged.default --privsep_sock_path 
>> /tmp/tmpG5iqb9/tmp1dMGq0/privsep.sock
>> vagrant  18555  0.0  0.0  14512  1092 pts/2    S+   13:25   0:00 grep 
>> --color=auto privsep
>> But then when I continue run test, and it segfaulted, in journal log I 
>> have:
>> Jan 08 13:25:29 fullstack-ubuntu kernel: privsep-helper[18369] 
>> segfault at 140043e8 ip 00007f8e1800ef32 sp 00007f8e18a63320 error 4 
>> in libnetfilter_conntrack.so.3.5.0[7f8e18009000+1a000]
>> Please check pics of those processes. First one (when test was 
>> „paused” with pdb) has 18368 and later segfault has 18369.
> privsep-helper does fork, so I _think_ that's normal.
> https://github.com/openstack/oslo.privsep/blob/ecb1870c29b760f09fb933fc8ebb3eac29ffd03e/oslo_privsep/daemon.py#L539 
>> I don’t know if You saw my today’s comment in launchpad. I was trying 
>> to change method used to start PrivsepDaemon from Method.ROOTWRAP to 
>> Method.FORK (in 
>> https://github.com/openstack/oslo.privsep/blob/master/oslo_privsep/priv_context.py#L218) 
>> and run test as root, then tests were passed.
> Yeah, I saw that, but I don't understand it. :-/
> The daemon should end up running with the same capabilities in either 
> case. By the time it starts making the C calls the environment should be 
> identical, regardless of which method was used to start the process.
>>>> Slawek Kaplonski
>> Senior software engineer
>> Red Hat
>>> Wiadomość napisana przez Ben Nemec <openstack at nemebean.com> w dniu 
>>> 08.01.2019, o godz. 20:04:
>>> Further update: I dusted off my gdb skills and attached it to the 
>>> privsep process to try to get more details about exactly what is 
>>> crashing. It looks like the segfault happens on this line:
>>> https://git.netfilter.org/libnetfilter_conntrack/tree/src/conntrack/api.c#n239 
>>> which is
>>> h->cb = cb;
>>> h being the conntrack handle and cb being the callback function.
>>> This makes me think the problem isn't the callback itself (even if we 
>>> assigned a bogus pointer, which we didn't, it shouldn't cause a 
>>> segfault unless you try to dereference it) but in the handle we pass 
>>> in. Trying to look at h->cb results in:
>>> (gdb) print h->cb
>>> Cannot access memory at address 0x800f228
>>> Interestingly, h itself is fine:
>>> (gdb) print h
>>> $3 = (struct nfct_handle *) 0x800f1e0
>>> It doesn't _look_ to me like the handle should be crossing any thread 
>>> boundaries or anything, so I'm not sure why it would be a problem. It 
>>> gets created in the same privileged function that ultimately 
>>> registers the callback: 
>>> https://github.com/openstack/neutron/blob/aa8a6ea848aae6882abb631b7089836dee8f4008/neutron/privileged/agent/linux/netlink_lib.py#L246 
>>> So still not sure what's going on, but I thought I'd share what I've 
>>> found before I stop to eat something.
>>> -Ben
>>> On 1/7/19 12:11 PM, Ben Nemec wrote:
>>>> Renamed the thread to be more descriptive.
>>>> Just to update the list on this, it looks like the problem is a 
>>>> segfault when the netlink_lib module makes a C call. Digging into 
>>>> that code a bit, it appears there is a callback being used[1]. I've 
>>>> seen some comments that when you use a callback with a Python 
>>>> thread, the thread needs to be registered somehow, but this is all 
>>>> uncharted territory for me. Suggestions gratefully accepted. :-)
>>>> 1: 
>>>> https://github.com/openstack/neutron/blob/master/neutron/privileged/agent/linux/netlink_lib.py#L136 
>>>> On 1/4/19 7:28 AM, Slawomir Kaplonski wrote:
>>>>> Hi,
>>>>> I just found that functional tests in Neutron are failing since 
>>>>> today or maybe yesterday. See [1]
>>>>> I was able to reproduce it locally and it looks that it happens 
>>>>> with oslo.privsep==1.31. With oslo.privsep==1.30.1 tests are fine.
>>>>> [1] https://bugs.launchpad.net/neutron/+bug/1810518
>>>>>>>>>> Slawek Kaplonski
>>>>> Senior software engineer
>>>>> Red Hat

More information about the openstack-discuss mailing list