I think I've got it. At least in my local tests, the handle pointer
being passed from C -> Python -> C was getting truncated at the Python
step because we didn't properly define the type. If the address assigned
was larger than would fit in a standard int then we passed what amounted
to a bogus pointer back to the C code, which caused the segfault.
I have no idea why privsep threading would have exposed this, other than
maybe running in threads affected the address space somehow?
In any case, https://review.openstack.org/629335 has got these
functional tests working for me locally in oslo.privsep 1.31.0. It would
be great if somebody could try them out and verify that I didn't just
find a solution that somehow only works on my system. :-)
-Ben
On 1/8/19 4:30 PM, Ben Nemec wrote:
>
>
> On 1/8/19 2:22 PM, Slawomir Kaplonski wrote:
>> Hi Ben,
>>
>> I was also looking at it today. I’m totally not an C and Oslo.privsep
>> expert but I think that there is some new process spawned here.
>> I put pdb before line
>> https://github.com/openstack/neutron/blob/master/neutron/privileged/agent/linux/netlink_lib.py#L191
>> where this issue happen. Then, with "ps aux” I saw:
>>
>> vagrant@fullstack-ubuntu ~ $ ps aux | grep privsep
>> root 18368 0.1 0.5 185752 33544 pts/1 Sl+ 13:24 0:00
>> /opt/stack/neutron/.tox/dsvm-functional/bin/python
>> /opt/stack/neutron/.tox/dsvm-functional/bin/privsep-helper
>> --config-file neutron/tests/etc/neutron.conf --privsep_context
>> neutron.privileged.default --privsep_sock_path
>> /tmp/tmpG5iqb9/tmp1dMGq0/privsep.sock
>> vagrant 18555 0.0 0.0 14512 1092 pts/2 S+ 13:25 0:00 grep
>> --color=auto privsep
>>
>> But then when I continue run test, and it segfaulted, in journal log I
>> have:
>>
>> Jan 08 13:25:29 fullstack-ubuntu kernel: privsep-helper[18369]
>> segfault at 140043e8 ip 00007f8e1800ef32 sp 00007f8e18a63320 error 4
>> in libnetfilter_conntrack.so.3.5.0[7f8e18009000+1a000]
>>
>> Please check pics of those processes. First one (when test was
>> „paused” with pdb) has 18368 and later segfault has 18369.
>
> privsep-helper does fork, so I _think_ that's normal.
>
> https://github.com/openstack/oslo.privsep/blob/ecb1870c29b760f09fb933fc8ebb3eac29ffd03e/oslo_privsep/daemon.py#L539
>
>
>>
>> I don’t know if You saw my today’s comment in launchpad. I was trying
>> to change method used to start PrivsepDaemon from Method.ROOTWRAP to
>> Method.FORK (in
>> https://github.com/openstack/oslo.privsep/blob/master/oslo_privsep/priv_context.py#L218)
>> and run test as root, then tests were passed.
>
> Yeah, I saw that, but I don't understand it. :-/
>
> The daemon should end up running with the same capabilities in either
> case. By the time it starts making the C calls the environment should be
> identical, regardless of which method was used to start the process.
>
>>
>> —
>> Slawek Kaplonski
>> Senior software engineer
>> Red Hat
>>
>>> Wiadomość napisana przez Ben Nemec <openstack@nemebean.com> w dniu
>>> 08.01.2019, o godz. 20:04:
>>>
>>> Further update: I dusted off my gdb skills and attached it to the
>>> privsep process to try to get more details about exactly what is
>>> crashing. It looks like the segfault happens on this line:
>>>
>>> https://git.netfilter.org/libnetfilter_conntrack/tree/src/conntrack/api.c#n239
>>>
>>>
>>> which is
>>>
>>> h->cb = cb;
>>>
>>> h being the conntrack handle and cb being the callback function.
>>>
>>> This makes me think the problem isn't the callback itself (even if we
>>> assigned a bogus pointer, which we didn't, it shouldn't cause a
>>> segfault unless you try to dereference it) but in the handle we pass
>>> in. Trying to look at h->cb results in:
>>>
>>> (gdb) print h->cb
>>> Cannot access memory at address 0x800f228
>>>
>>> Interestingly, h itself is fine:
>>>
>>> (gdb) print h
>>> $3 = (struct nfct_handle *) 0x800f1e0
>>>
>>> It doesn't _look_ to me like the handle should be crossing any thread
>>> boundaries or anything, so I'm not sure why it would be a problem. It
>>> gets created in the same privileged function that ultimately
>>> registers the callback:
>>> https://github.com/openstack/neutron/blob/aa8a6ea848aae6882abb631b7089836dee8f4008/neutron/privileged/agent/linux/netlink_lib.py#L246
>>>
>>>
>>> So still not sure what's going on, but I thought I'd share what I've
>>> found before I stop to eat something.
>>>
>>> -Ben
>>>
>>> On 1/7/19 12:11 PM, Ben Nemec wrote:
>>>> Renamed the thread to be more descriptive.
>>>> Just to update the list on this, it looks like the problem is a
>>>> segfault when the netlink_lib module makes a C call. Digging into
>>>> that code a bit, it appears there is a callback being used[1]. I've
>>>> seen some comments that when you use a callback with a Python
>>>> thread, the thread needs to be registered somehow, but this is all
>>>> uncharted territory for me. Suggestions gratefully accepted. :-)
>>>> 1:
>>>> https://github.com/openstack/neutron/blob/master/neutron/privileged/agent/linux/netlink_lib.py#L136
>>>> On 1/4/19 7:28 AM, Slawomir Kaplonski wrote:
>>>>> Hi,
>>>>>
>>>>> I just found that functional tests in Neutron are failing since
>>>>> today or maybe yesterday. See [1]
>>>>> I was able to reproduce it locally and it looks that it happens
>>>>> with oslo.privsep==1.31. With oslo.privsep==1.30.1 tests are fine.
>>>>>
>>>>> [1] https://bugs.launchpad.net/neutron/+bug/1810518
>>>>>
>>>>> —
>>>>> Slawek Kaplonski
>>>>> Senior software engineer
>>>>> Red Hat
>>>>>
>>
>