I think I've got it. At least in my local tests, the handle pointer being passed from C -> Python -> C was getting truncated at the Python step because we didn't properly define the type. If the address assigned was larger than would fit in a standard int then we passed what amounted to a bogus pointer back to the C code, which caused the segfault. I have no idea why privsep threading would have exposed this, other than maybe running in threads affected the address space somehow? In any case, https://review.openstack.org/629335 has got these functional tests working for me locally in oslo.privsep 1.31.0. It would be great if somebody could try them out and verify that I didn't just find a solution that somehow only works on my system. :-) -Ben On 1/8/19 4:30 PM, Ben Nemec wrote:
On 1/8/19 2:22 PM, Slawomir Kaplonski wrote:
Hi Ben,
I was also looking at it today. I’m totally not an C and Oslo.privsep expert but I think that there is some new process spawned here. I put pdb before line https://github.com/openstack/neutron/blob/master/neutron/privileged/agent/li... where this issue happen. Then, with "ps aux” I saw:
vagrant@fullstack-ubuntu ~ $ ps aux | grep privsep root 18368 0.1 0.5 185752 33544 pts/1 Sl+ 13:24 0:00 /opt/stack/neutron/.tox/dsvm-functional/bin/python /opt/stack/neutron/.tox/dsvm-functional/bin/privsep-helper --config-file neutron/tests/etc/neutron.conf --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmpG5iqb9/tmp1dMGq0/privsep.sock vagrant 18555 0.0 0.0 14512 1092 pts/2 S+ 13:25 0:00 grep --color=auto privsep
But then when I continue run test, and it segfaulted, in journal log I have:
Jan 08 13:25:29 fullstack-ubuntu kernel: privsep-helper[18369] segfault at 140043e8 ip 00007f8e1800ef32 sp 00007f8e18a63320 error 4 in libnetfilter_conntrack.so.3.5.0[7f8e18009000+1a000]
Please check pics of those processes. First one (when test was „paused” with pdb) has 18368 and later segfault has 18369.
privsep-helper does fork, so I _think_ that's normal.
https://github.com/openstack/oslo.privsep/blob/ecb1870c29b760f09fb933fc8ebb3...
I don’t know if You saw my today’s comment in launchpad. I was trying to change method used to start PrivsepDaemon from Method.ROOTWRAP to Method.FORK (in https://github.com/openstack/oslo.privsep/blob/master/oslo_privsep/priv_cont...) and run test as root, then tests were passed.
Yeah, I saw that, but I don't understand it. :-/
The daemon should end up running with the same capabilities in either case. By the time it starts making the C calls the environment should be identical, regardless of which method was used to start the process.
— Slawek Kaplonski Senior software engineer Red Hat
Wiadomość napisana przez Ben Nemec <openstack@nemebean.com> w dniu 08.01.2019, o godz. 20:04:
Further update: I dusted off my gdb skills and attached it to the privsep process to try to get more details about exactly what is crashing. It looks like the segfault happens on this line:
https://git.netfilter.org/libnetfilter_conntrack/tree/src/conntrack/api.c#n2...
which is
h->cb = cb;
h being the conntrack handle and cb being the callback function.
This makes me think the problem isn't the callback itself (even if we assigned a bogus pointer, which we didn't, it shouldn't cause a segfault unless you try to dereference it) but in the handle we pass in. Trying to look at h->cb results in:
(gdb) print h->cb Cannot access memory at address 0x800f228
Interestingly, h itself is fine:
(gdb) print h $3 = (struct nfct_handle *) 0x800f1e0
It doesn't _look_ to me like the handle should be crossing any thread boundaries or anything, so I'm not sure why it would be a problem. It gets created in the same privileged function that ultimately registers the callback: https://github.com/openstack/neutron/blob/aa8a6ea848aae6882abb631b7089836dee...
So still not sure what's going on, but I thought I'd share what I've found before I stop to eat something.
-Ben
On 1/7/19 12:11 PM, Ben Nemec wrote:
Renamed the thread to be more descriptive. Just to update the list on this, it looks like the problem is a segfault when the netlink_lib module makes a C call. Digging into that code a bit, it appears there is a callback being used[1]. I've seen some comments that when you use a callback with a Python thread, the thread needs to be registered somehow, but this is all uncharted territory for me. Suggestions gratefully accepted. :-) 1: https://github.com/openstack/neutron/blob/master/neutron/privileged/agent/li... On 1/4/19 7:28 AM, Slawomir Kaplonski wrote:
Hi,
I just found that functional tests in Neutron are failing since today or maybe yesterday. See [1] I was able to reproduce it locally and it looks that it happens with oslo.privsep==1.31. With oslo.privsep==1.30.1 tests are fine.
[1] https://bugs.launchpad.net/neutron/+bug/1810518
— Slawek Kaplonski Senior software engineer Red Hat