[oslo][neutron] Neutron Functional Test Failures with oslo.privsep 1.31.0

Ben Nemec openstack at nemebean.com
Tue Jan 8 22:30:03 UTC 2019



On 1/8/19 2:22 PM, Slawomir Kaplonski wrote:
> Hi Ben,
> 
> I was also looking at it today. I’m totally not an C and Oslo.privsep expert but I think that there is some new process spawned here.
> I put pdb before line https://github.com/openstack/neutron/blob/master/neutron/privileged/agent/linux/netlink_lib.py#L191 where this issue happen. Then, with "ps aux” I saw:
> 
> vagrant at fullstack-ubuntu ~ $ ps aux | grep privsep
> root     18368  0.1  0.5 185752 33544 pts/1    Sl+  13:24   0:00 /opt/stack/neutron/.tox/dsvm-functional/bin/python /opt/stack/neutron/.tox/dsvm-functional/bin/privsep-helper --config-file neutron/tests/etc/neutron.conf --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmpG5iqb9/tmp1dMGq0/privsep.sock
> vagrant  18555  0.0  0.0  14512  1092 pts/2    S+   13:25   0:00 grep --color=auto privsep
> 
> But then when I continue run test, and it segfaulted, in journal log I have:
> 
> Jan 08 13:25:29 fullstack-ubuntu kernel: privsep-helper[18369] segfault at 140043e8 ip 00007f8e1800ef32 sp 00007f8e18a63320 error 4 in libnetfilter_conntrack.so.3.5.0[7f8e18009000+1a000]
> 
> Please check pics of those processes. First one (when test was „paused” with pdb) has 18368 and later segfault has 18369.

privsep-helper does fork, so I _think_ that's normal.

https://github.com/openstack/oslo.privsep/blob/ecb1870c29b760f09fb933fc8ebb3eac29ffd03e/oslo_privsep/daemon.py#L539

> 
> I don’t know if You saw my today’s comment in launchpad. I was trying to change method used to start PrivsepDaemon from Method.ROOTWRAP to Method.FORK (in https://github.com/openstack/oslo.privsep/blob/master/oslo_privsep/priv_context.py#L218) and run test as root, then tests were passed.

Yeah, I saw that, but I don't understand it. :-/

The daemon should end up running with the same capabilities in either 
case. By the time it starts making the C calls the environment should be 
identical, regardless of which method was used to start the process.

> 
>> Slawek Kaplonski
> Senior software engineer
> Red Hat
> 
>> Wiadomość napisana przez Ben Nemec <openstack at nemebean.com> w dniu 08.01.2019, o godz. 20:04:
>>
>> Further update: I dusted off my gdb skills and attached it to the privsep process to try to get more details about exactly what is crashing. It looks like the segfault happens on this line:
>>
>> https://git.netfilter.org/libnetfilter_conntrack/tree/src/conntrack/api.c#n239
>>
>> which is
>>
>> h->cb = cb;
>>
>> h being the conntrack handle and cb being the callback function.
>>
>> This makes me think the problem isn't the callback itself (even if we assigned a bogus pointer, which we didn't, it shouldn't cause a segfault unless you try to dereference it) but in the handle we pass in. Trying to look at h->cb results in:
>>
>> (gdb) print h->cb
>> Cannot access memory at address 0x800f228
>>
>> Interestingly, h itself is fine:
>>
>> (gdb) print h
>> $3 = (struct nfct_handle *) 0x800f1e0
>>
>> It doesn't _look_ to me like the handle should be crossing any thread boundaries or anything, so I'm not sure why it would be a problem. It gets created in the same privileged function that ultimately registers the callback: https://github.com/openstack/neutron/blob/aa8a6ea848aae6882abb631b7089836dee8f4008/neutron/privileged/agent/linux/netlink_lib.py#L246
>>
>> So still not sure what's going on, but I thought I'd share what I've found before I stop to eat something.
>>
>> -Ben
>>
>> On 1/7/19 12:11 PM, Ben Nemec wrote:
>>> Renamed the thread to be more descriptive.
>>> Just to update the list on this, it looks like the problem is a segfault when the netlink_lib module makes a C call. Digging into that code a bit, it appears there is a callback being used[1]. I've seen some comments that when you use a callback with a Python thread, the thread needs to be registered somehow, but this is all uncharted territory for me. Suggestions gratefully accepted. :-)
>>> 1: https://github.com/openstack/neutron/blob/master/neutron/privileged/agent/linux/netlink_lib.py#L136 On 1/4/19 7:28 AM, Slawomir Kaplonski wrote:
>>>> Hi,
>>>>
>>>> I just found that functional tests in Neutron are failing since today or maybe yesterday. See [1]
>>>> I was able to reproduce it locally and it looks that it happens with oslo.privsep==1.31. With oslo.privsep==1.30.1 tests are fine.
>>>>
>>>> [1] https://bugs.launchpad.net/neutron/+bug/1810518
>>>>
>>>>>>>> Slawek Kaplonski
>>>> Senior software engineer
>>>> Red Hat
>>>>
> 



More information about the openstack-discuss mailing list