Yay alliteration! :-) I wanted to draw attention to this release[1] in particular because it includes the parallel privsep change[2]. While it shouldn't have any effect on the public API of the library, it does significantly affect how privsep will process calls on the back end. Specifically, multiple calls can now be processed at the same time, so if any privileged code is not reentrant it's possible that new race bugs could pop up. While this sounds scary, it's a necessary change to allow use of privsep in situations where a privileged call may take a non-trivial amount of time. Cinder in particular has some privileged calls that are long-running and can't afford to block all other privileged calls on them. So if you're a consumer of oslo.privsep please keep your eyes open for issues related to this new release and contact the Oslo team if you find any. Thanks. -Ben 1: https://review.openstack.org/628019 2: https://review.openstack.org/#/c/593556/
Hi, I just found that functional tests in Neutron are failing since today or maybe yesterday. See [1] I was able to reproduce it locally and it looks that it happens with oslo.privsep==1.31. With oslo.privsep==1.30.1 tests are fine. [1] https://bugs.launchpad.net/neutron/+bug/1810518 — Slawek Kaplonski Senior software engineer Red Hat
Wiadomość napisana przez Ben Nemec <openstack@nemebean.com> w dniu 02.01.2019, o godz. 19:17:
Yay alliteration! :-)
I wanted to draw attention to this release[1] in particular because it includes the parallel privsep change[2]. While it shouldn't have any effect on the public API of the library, it does significantly affect how privsep will process calls on the back end. Specifically, multiple calls can now be processed at the same time, so if any privileged code is not reentrant it's possible that new race bugs could pop up.
While this sounds scary, it's a necessary change to allow use of privsep in situations where a privileged call may take a non-trivial amount of time. Cinder in particular has some privileged calls that are long-running and can't afford to block all other privileged calls on them.
So if you're a consumer of oslo.privsep please keep your eyes open for issues related to this new release and contact the Oslo team if you find any. Thanks.
-Ben
1: https://review.openstack.org/628019 2: https://review.openstack.org/#/c/593556/
Renamed the thread to be more descriptive. Just to update the list on this, it looks like the problem is a segfault when the netlink_lib module makes a C call. Digging into that code a bit, it appears there is a callback being used[1]. I've seen some comments that when you use a callback with a Python thread, the thread needs to be registered somehow, but this is all uncharted territory for me. Suggestions gratefully accepted. :-) 1: https://github.com/openstack/neutron/blob/master/neutron/privileged/agent/li... On 1/4/19 7:28 AM, Slawomir Kaplonski wrote:
Hi,
I just found that functional tests in Neutron are failing since today or maybe yesterday. See [1] I was able to reproduce it locally and it looks that it happens with oslo.privsep==1.31. With oslo.privsep==1.30.1 tests are fine.
[1] https://bugs.launchpad.net/neutron/+bug/1810518
— Slawek Kaplonski Senior software engineer Red Hat
Wiadomość napisana przez Ben Nemec <openstack@nemebean.com> w dniu 02.01.2019, o godz. 19:17:
Yay alliteration! :-)
I wanted to draw attention to this release[1] in particular because it includes the parallel privsep change[2]. While it shouldn't have any effect on the public API of the library, it does significantly affect how privsep will process calls on the back end. Specifically, multiple calls can now be processed at the same time, so if any privileged code is not reentrant it's possible that new race bugs could pop up.
While this sounds scary, it's a necessary change to allow use of privsep in situations where a privileged call may take a non-trivial amount of time. Cinder in particular has some privileged calls that are long-running and can't afford to block all other privileged calls on them.
So if you're a consumer of oslo.privsep please keep your eyes open for issues related to this new release and contact the Oslo team if you find any. Thanks.
-Ben
1: https://review.openstack.org/628019 2: https://review.openstack.org/#/c/593556/
Hi Ben, On 1/7/19 1:11 PM, Ben Nemec wrote:
Renamed the thread to be more descriptive.
Just to update the list on this, it looks like the problem is a segfault when the netlink_lib module makes a C call. Digging into that code a bit, it appears there is a callback being used[1]. I've seen some comments that when you use a callback with a Python thread, the thread needs to be registered somehow, but this is all uncharted territory for me. Suggestions gratefully accepted. :-)
1: https://github.com/openstack/neutron/blob/master/neutron/privileged/agent/li...
Maybe it's something as mentioned in the end of this section? https://docs.python.org/2/library/ctypes.html#callback-functions "Note Make sure you keep references to CFUNCTYPE() objects as long as they are used from C code. ctypes doesn’t, and if you don’t, they may be garbage collected, crashing your program when a callback is made. Also, note that if the callback function is called in a thread created outside of Python’s control (e.g. by the foreign code that calls the callback), ctypes creates a new dummy Python thread on every invocation. This behavior is correct for most purposes, but it means that values stored with threading.local will not survive across different callbacks, even when those calls are made from the same C thread." I can try keeping a reference to the callback function and see if it makes any difference, but I'm assuming it's not that easy. -Brian
On 1/4/19 7:28 AM, Slawomir Kaplonski wrote:
Hi,
I just found that functional tests in Neutron are failing since today or maybe yesterday. See [1] I was able to reproduce it locally and it looks that it happens with oslo.privsep==1.31. With oslo.privsep==1.30.1 tests are fine.
[1] https://bugs.launchpad.net/neutron/+bug/1810518
— Slawek Kaplonski Senior software engineer Red Hat
Wiadomość napisana przez Ben Nemec <openstack@nemebean.com> w dniu 02.01.2019, o godz. 19:17:
Yay alliteration! :-)
I wanted to draw attention to this release[1] in particular because it includes the parallel privsep change[2]. While it shouldn't have any effect on the public API of the library, it does significantly affect how privsep will process calls on the back end. Specifically, multiple calls can now be processed at the same time, so if any privileged code is not reentrant it's possible that new race bugs could pop up.
While this sounds scary, it's a necessary change to allow use of privsep in situations where a privileged call may take a non-trivial amount of time. Cinder in particular has some privileged calls that are long-running and can't afford to block all other privileged calls on them.
So if you're a consumer of oslo.privsep please keep your eyes open for issues related to this new release and contact the Oslo team if you find any. Thanks.
-Ben
1: https://review.openstack.org/628019 2: https://review.openstack.org/#/c/593556/
Further update: I dusted off my gdb skills and attached it to the privsep process to try to get more details about exactly what is crashing. It looks like the segfault happens on this line: https://git.netfilter.org/libnetfilter_conntrack/tree/src/conntrack/api.c#n2... which is h->cb = cb; h being the conntrack handle and cb being the callback function. This makes me think the problem isn't the callback itself (even if we assigned a bogus pointer, which we didn't, it shouldn't cause a segfault unless you try to dereference it) but in the handle we pass in. Trying to look at h->cb results in: (gdb) print h->cb Cannot access memory at address 0x800f228 Interestingly, h itself is fine: (gdb) print h $3 = (struct nfct_handle *) 0x800f1e0 It doesn't _look_ to me like the handle should be crossing any thread boundaries or anything, so I'm not sure why it would be a problem. It gets created in the same privileged function that ultimately registers the callback: https://github.com/openstack/neutron/blob/aa8a6ea848aae6882abb631b7089836dee... So still not sure what's going on, but I thought I'd share what I've found before I stop to eat something. -Ben On 1/7/19 12:11 PM, Ben Nemec wrote:
Renamed the thread to be more descriptive.
Just to update the list on this, it looks like the problem is a segfault when the netlink_lib module makes a C call. Digging into that code a bit, it appears there is a callback being used[1]. I've seen some comments that when you use a callback with a Python thread, the thread needs to be registered somehow, but this is all uncharted territory for me. Suggestions gratefully accepted. :-)
1: https://github.com/openstack/neutron/blob/master/neutron/privileged/agent/li...
On 1/4/19 7:28 AM, Slawomir Kaplonski wrote:
Hi,
I just found that functional tests in Neutron are failing since today or maybe yesterday. See [1] I was able to reproduce it locally and it looks that it happens with oslo.privsep==1.31. With oslo.privsep==1.30.1 tests are fine.
[1] https://bugs.launchpad.net/neutron/+bug/1810518
— Slawek Kaplonski Senior software engineer Red Hat
Hi Ben, I was also looking at it today. I’m totally not an C and Oslo.privsep expert but I think that there is some new process spawned here. I put pdb before line https://github.com/openstack/neutron/blob/master/neutron/privileged/agent/li... where this issue happen. Then, with "ps aux” I saw: vagrant@fullstack-ubuntu ~ $ ps aux | grep privsep root 18368 0.1 0.5 185752 33544 pts/1 Sl+ 13:24 0:00 /opt/stack/neutron/.tox/dsvm-functional/bin/python /opt/stack/neutron/.tox/dsvm-functional/bin/privsep-helper --config-file neutron/tests/etc/neutron.conf --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmpG5iqb9/tmp1dMGq0/privsep.sock vagrant 18555 0.0 0.0 14512 1092 pts/2 S+ 13:25 0:00 grep --color=auto privsep But then when I continue run test, and it segfaulted, in journal log I have: Jan 08 13:25:29 fullstack-ubuntu kernel: privsep-helper[18369] segfault at 140043e8 ip 00007f8e1800ef32 sp 00007f8e18a63320 error 4 in libnetfilter_conntrack.so.3.5.0[7f8e18009000+1a000] Please check pics of those processes. First one (when test was „paused” with pdb) has 18368 and later segfault has 18369. I don’t know if You saw my today’s comment in launchpad. I was trying to change method used to start PrivsepDaemon from Method.ROOTWRAP to Method.FORK (in https://github.com/openstack/oslo.privsep/blob/master/oslo_privsep/priv_cont...) and run test as root, then tests were passed. — Slawek Kaplonski Senior software engineer Red Hat
Wiadomość napisana przez Ben Nemec <openstack@nemebean.com> w dniu 08.01.2019, o godz. 20:04:
Further update: I dusted off my gdb skills and attached it to the privsep process to try to get more details about exactly what is crashing. It looks like the segfault happens on this line:
https://git.netfilter.org/libnetfilter_conntrack/tree/src/conntrack/api.c#n2...
which is
h->cb = cb;
h being the conntrack handle and cb being the callback function.
This makes me think the problem isn't the callback itself (even if we assigned a bogus pointer, which we didn't, it shouldn't cause a segfault unless you try to dereference it) but in the handle we pass in. Trying to look at h->cb results in:
(gdb) print h->cb Cannot access memory at address 0x800f228
Interestingly, h itself is fine:
(gdb) print h $3 = (struct nfct_handle *) 0x800f1e0
It doesn't _look_ to me like the handle should be crossing any thread boundaries or anything, so I'm not sure why it would be a problem. It gets created in the same privileged function that ultimately registers the callback: https://github.com/openstack/neutron/blob/aa8a6ea848aae6882abb631b7089836dee...
So still not sure what's going on, but I thought I'd share what I've found before I stop to eat something.
-Ben
On 1/7/19 12:11 PM, Ben Nemec wrote:
Renamed the thread to be more descriptive. Just to update the list on this, it looks like the problem is a segfault when the netlink_lib module makes a C call. Digging into that code a bit, it appears there is a callback being used[1]. I've seen some comments that when you use a callback with a Python thread, the thread needs to be registered somehow, but this is all uncharted territory for me. Suggestions gratefully accepted. :-) 1: https://github.com/openstack/neutron/blob/master/neutron/privileged/agent/li... On 1/4/19 7:28 AM, Slawomir Kaplonski wrote:
Hi,
I just found that functional tests in Neutron are failing since today or maybe yesterday. See [1] I was able to reproduce it locally and it looks that it happens with oslo.privsep==1.31. With oslo.privsep==1.30.1 tests are fine.
[1] https://bugs.launchpad.net/neutron/+bug/1810518
— Slawek Kaplonski Senior software engineer Red Hat
On 1/8/19 2:22 PM, Slawomir Kaplonski wrote:
Hi Ben,
I was also looking at it today. I’m totally not an C and Oslo.privsep expert but I think that there is some new process spawned here. I put pdb before line https://github.com/openstack/neutron/blob/master/neutron/privileged/agent/li... where this issue happen. Then, with "ps aux” I saw:
vagrant@fullstack-ubuntu ~ $ ps aux | grep privsep root 18368 0.1 0.5 185752 33544 pts/1 Sl+ 13:24 0:00 /opt/stack/neutron/.tox/dsvm-functional/bin/python /opt/stack/neutron/.tox/dsvm-functional/bin/privsep-helper --config-file neutron/tests/etc/neutron.conf --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmpG5iqb9/tmp1dMGq0/privsep.sock vagrant 18555 0.0 0.0 14512 1092 pts/2 S+ 13:25 0:00 grep --color=auto privsep
But then when I continue run test, and it segfaulted, in journal log I have:
Jan 08 13:25:29 fullstack-ubuntu kernel: privsep-helper[18369] segfault at 140043e8 ip 00007f8e1800ef32 sp 00007f8e18a63320 error 4 in libnetfilter_conntrack.so.3.5.0[7f8e18009000+1a000]
Please check pics of those processes. First one (when test was „paused” with pdb) has 18368 and later segfault has 18369.
privsep-helper does fork, so I _think_ that's normal. https://github.com/openstack/oslo.privsep/blob/ecb1870c29b760f09fb933fc8ebb3...
I don’t know if You saw my today’s comment in launchpad. I was trying to change method used to start PrivsepDaemon from Method.ROOTWRAP to Method.FORK (in https://github.com/openstack/oslo.privsep/blob/master/oslo_privsep/priv_cont...) and run test as root, then tests were passed.
Yeah, I saw that, but I don't understand it. :-/ The daemon should end up running with the same capabilities in either case. By the time it starts making the C calls the environment should be identical, regardless of which method was used to start the process.
— Slawek Kaplonski Senior software engineer Red Hat
Wiadomość napisana przez Ben Nemec <openstack@nemebean.com> w dniu 08.01.2019, o godz. 20:04:
Further update: I dusted off my gdb skills and attached it to the privsep process to try to get more details about exactly what is crashing. It looks like the segfault happens on this line:
https://git.netfilter.org/libnetfilter_conntrack/tree/src/conntrack/api.c#n2...
which is
h->cb = cb;
h being the conntrack handle and cb being the callback function.
This makes me think the problem isn't the callback itself (even if we assigned a bogus pointer, which we didn't, it shouldn't cause a segfault unless you try to dereference it) but in the handle we pass in. Trying to look at h->cb results in:
(gdb) print h->cb Cannot access memory at address 0x800f228
Interestingly, h itself is fine:
(gdb) print h $3 = (struct nfct_handle *) 0x800f1e0
It doesn't _look_ to me like the handle should be crossing any thread boundaries or anything, so I'm not sure why it would be a problem. It gets created in the same privileged function that ultimately registers the callback: https://github.com/openstack/neutron/blob/aa8a6ea848aae6882abb631b7089836dee...
So still not sure what's going on, but I thought I'd share what I've found before I stop to eat something.
-Ben
On 1/7/19 12:11 PM, Ben Nemec wrote:
Renamed the thread to be more descriptive. Just to update the list on this, it looks like the problem is a segfault when the netlink_lib module makes a C call. Digging into that code a bit, it appears there is a callback being used[1]. I've seen some comments that when you use a callback with a Python thread, the thread needs to be registered somehow, but this is all uncharted territory for me. Suggestions gratefully accepted. :-) 1: https://github.com/openstack/neutron/blob/master/neutron/privileged/agent/li... On 1/4/19 7:28 AM, Slawomir Kaplonski wrote:
Hi,
I just found that functional tests in Neutron are failing since today or maybe yesterday. See [1] I was able to reproduce it locally and it looks that it happens with oslo.privsep==1.31. With oslo.privsep==1.30.1 tests are fine.
[1] https://bugs.launchpad.net/neutron/+bug/1810518
— Slawek Kaplonski Senior software engineer Red Hat
I think I've got it. At least in my local tests, the handle pointer being passed from C -> Python -> C was getting truncated at the Python step because we didn't properly define the type. If the address assigned was larger than would fit in a standard int then we passed what amounted to a bogus pointer back to the C code, which caused the segfault. I have no idea why privsep threading would have exposed this, other than maybe running in threads affected the address space somehow? In any case, https://review.openstack.org/629335 has got these functional tests working for me locally in oslo.privsep 1.31.0. It would be great if somebody could try them out and verify that I didn't just find a solution that somehow only works on my system. :-) -Ben On 1/8/19 4:30 PM, Ben Nemec wrote:
On 1/8/19 2:22 PM, Slawomir Kaplonski wrote:
Hi Ben,
I was also looking at it today. I’m totally not an C and Oslo.privsep expert but I think that there is some new process spawned here. I put pdb before line https://github.com/openstack/neutron/blob/master/neutron/privileged/agent/li... where this issue happen. Then, with "ps aux” I saw:
vagrant@fullstack-ubuntu ~ $ ps aux | grep privsep root 18368 0.1 0.5 185752 33544 pts/1 Sl+ 13:24 0:00 /opt/stack/neutron/.tox/dsvm-functional/bin/python /opt/stack/neutron/.tox/dsvm-functional/bin/privsep-helper --config-file neutron/tests/etc/neutron.conf --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmpG5iqb9/tmp1dMGq0/privsep.sock vagrant 18555 0.0 0.0 14512 1092 pts/2 S+ 13:25 0:00 grep --color=auto privsep
But then when I continue run test, and it segfaulted, in journal log I have:
Jan 08 13:25:29 fullstack-ubuntu kernel: privsep-helper[18369] segfault at 140043e8 ip 00007f8e1800ef32 sp 00007f8e18a63320 error 4 in libnetfilter_conntrack.so.3.5.0[7f8e18009000+1a000]
Please check pics of those processes. First one (when test was „paused” with pdb) has 18368 and later segfault has 18369.
privsep-helper does fork, so I _think_ that's normal.
https://github.com/openstack/oslo.privsep/blob/ecb1870c29b760f09fb933fc8ebb3...
I don’t know if You saw my today’s comment in launchpad. I was trying to change method used to start PrivsepDaemon from Method.ROOTWRAP to Method.FORK (in https://github.com/openstack/oslo.privsep/blob/master/oslo_privsep/priv_cont...) and run test as root, then tests were passed.
Yeah, I saw that, but I don't understand it. :-/
The daemon should end up running with the same capabilities in either case. By the time it starts making the C calls the environment should be identical, regardless of which method was used to start the process.
— Slawek Kaplonski Senior software engineer Red Hat
Wiadomość napisana przez Ben Nemec <openstack@nemebean.com> w dniu 08.01.2019, o godz. 20:04:
Further update: I dusted off my gdb skills and attached it to the privsep process to try to get more details about exactly what is crashing. It looks like the segfault happens on this line:
https://git.netfilter.org/libnetfilter_conntrack/tree/src/conntrack/api.c#n2...
which is
h->cb = cb;
h being the conntrack handle and cb being the callback function.
This makes me think the problem isn't the callback itself (even if we assigned a bogus pointer, which we didn't, it shouldn't cause a segfault unless you try to dereference it) but in the handle we pass in. Trying to look at h->cb results in:
(gdb) print h->cb Cannot access memory at address 0x800f228
Interestingly, h itself is fine:
(gdb) print h $3 = (struct nfct_handle *) 0x800f1e0
It doesn't _look_ to me like the handle should be crossing any thread boundaries or anything, so I'm not sure why it would be a problem. It gets created in the same privileged function that ultimately registers the callback: https://github.com/openstack/neutron/blob/aa8a6ea848aae6882abb631b7089836dee...
So still not sure what's going on, but I thought I'd share what I've found before I stop to eat something.
-Ben
On 1/7/19 12:11 PM, Ben Nemec wrote:
Renamed the thread to be more descriptive. Just to update the list on this, it looks like the problem is a segfault when the netlink_lib module makes a C call. Digging into that code a bit, it appears there is a callback being used[1]. I've seen some comments that when you use a callback with a Python thread, the thread needs to be registered somehow, but this is all uncharted territory for me. Suggestions gratefully accepted. :-) 1: https://github.com/openstack/neutron/blob/master/neutron/privileged/agent/li... On 1/4/19 7:28 AM, Slawomir Kaplonski wrote:
Hi,
I just found that functional tests in Neutron are failing since today or maybe yesterday. See [1] I was able to reproduce it locally and it looks that it happens with oslo.privsep==1.31. With oslo.privsep==1.30.1 tests are fine.
[1] https://bugs.launchpad.net/neutron/+bug/1810518
— Slawek Kaplonski Senior software engineer Red Hat
The similar failure happens in neutron-fwaas. This blocks several patches in neutron-fwaas including policy-in-code support. https://bugs.launchpad.net/neutron/+bug/1811506 Most failures are fixed by applying Ben's neutron fix https://review.openstack.org/#/c/629335/ [1], but we still have one failure in neutron_fwaas.tests.functional.privileged.test_utils.InNamespaceTest.test_in_namespace [2]. This failure is caused by oslo.privsep 1.31.0 too. This does not happen with 1.30.1. Any help would be appreciated. [1] neutron-fwaas change https://review.openstack.org/#/c/630451/ [2] http://logs.openstack.org/51/630451/2/check/legacy-neutron-fwaas-dsvm-functi... -- Akihiro Motoki (irc: amotoki) 2019年1月9日(水) 9:32 Ben Nemec <openstack@nemebean.com>:
I think I've got it. At least in my local tests, the handle pointer being passed from C -> Python -> C was getting truncated at the Python step because we didn't properly define the type. If the address assigned was larger than would fit in a standard int then we passed what amounted to a bogus pointer back to the C code, which caused the segfault.
I have no idea why privsep threading would have exposed this, other than maybe running in threads affected the address space somehow?
In any case, https://review.openstack.org/629335 has got these functional tests working for me locally in oslo.privsep 1.31.0. It would be great if somebody could try them out and verify that I didn't just find a solution that somehow only works on my system. :-)
-Ben
On 1/8/19 4:30 PM, Ben Nemec wrote:
On 1/8/19 2:22 PM, Slawomir Kaplonski wrote:
Hi Ben,
I was also looking at it today. I’m totally not an C and Oslo.privsep expert but I think that there is some new process spawned here. I put pdb before line
https://github.com/openstack/neutron/blob/master/neutron/privileged/agent/li...
where this issue happen. Then, with "ps aux” I saw:
vagrant@fullstack-ubuntu ~ $ ps aux | grep privsep root 18368 0.1 0.5 185752 33544 pts/1 Sl+ 13:24 0:00 /opt/stack/neutron/.tox/dsvm-functional/bin/python /opt/stack/neutron/.tox/dsvm-functional/bin/privsep-helper --config-file neutron/tests/etc/neutron.conf --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmpG5iqb9/tmp1dMGq0/privsep.sock vagrant 18555 0.0 0.0 14512 1092 pts/2 S+ 13:25 0:00 grep --color=auto privsep
But then when I continue run test, and it segfaulted, in journal log I have:
Jan 08 13:25:29 fullstack-ubuntu kernel: privsep-helper[18369] segfault at 140043e8 ip 00007f8e1800ef32 sp 00007f8e18a63320 error 4 in libnetfilter_conntrack.so.3.5.0[7f8e18009000+1a000]
Please check pics of those processes. First one (when test was „paused” with pdb) has 18368 and later segfault has 18369.
privsep-helper does fork, so I _think_ that's normal.
https://github.com/openstack/oslo.privsep/blob/ecb1870c29b760f09fb933fc8ebb3...
I don’t know if You saw my today’s comment in launchpad. I was trying to change method used to start PrivsepDaemon from Method.ROOTWRAP to Method.FORK (in
https://github.com/openstack/oslo.privsep/blob/master/oslo_privsep/priv_cont...)
and run test as root, then tests were passed.
Yeah, I saw that, but I don't understand it. :-/
The daemon should end up running with the same capabilities in either case. By the time it starts making the C calls the environment should be identical, regardless of which method was used to start the process.
— Slawek Kaplonski Senior software engineer Red Hat
Wiadomość napisana przez Ben Nemec <openstack@nemebean.com> w dniu 08.01.2019, o godz. 20:04:
Further update: I dusted off my gdb skills and attached it to the privsep process to try to get more details about exactly what is crashing. It looks like the segfault happens on this line:
https://git.netfilter.org/libnetfilter_conntrack/tree/src/conntrack/api.c#n2...
which is
h->cb = cb;
h being the conntrack handle and cb being the callback function.
This makes me think the problem isn't the callback itself (even if we assigned a bogus pointer, which we didn't, it shouldn't cause a segfault unless you try to dereference it) but in the handle we pass in. Trying to look at h->cb results in:
(gdb) print h->cb Cannot access memory at address 0x800f228
Interestingly, h itself is fine:
(gdb) print h $3 = (struct nfct_handle *) 0x800f1e0
It doesn't _look_ to me like the handle should be crossing any thread boundaries or anything, so I'm not sure why it would be a problem. It gets created in the same privileged function that ultimately registers the callback:
https://github.com/openstack/neutron/blob/aa8a6ea848aae6882abb631b7089836dee...
So still not sure what's going on, but I thought I'd share what I've found before I stop to eat something.
-Ben
On 1/7/19 12:11 PM, Ben Nemec wrote:
Renamed the thread to be more descriptive. Just to update the list on this, it looks like the problem is a segfault when the netlink_lib module makes a C call. Digging into that code a bit, it appears there is a callback being used[1]. I've seen some comments that when you use a callback with a Python thread, the thread needs to be registered somehow, but this is all uncharted territory for me. Suggestions gratefully accepted. :-) 1:
https://github.com/openstack/neutron/blob/master/neutron/privileged/agent/li...
On 1/4/19 7:28 AM, Slawomir Kaplonski wrote:
Hi,
I just found that functional tests in Neutron are failing since today or maybe yesterday. See [1] I was able to reproduce it locally and it looks that it happens with oslo.privsep==1.31. With oslo.privsep==1.30.1 tests are fine.
[1] https://bugs.launchpad.net/neutron/+bug/1810518
— Slawek Kaplonski Senior software engineer Red Hat
I tried to set up a test environment for this, but I'm having some issues. My local environment is defaulting to python 3, while the gate job appears to have been running under python 2. I'm not sure why it's doing that since the tox env definition doesn't specify python 3 (maybe something to do with https://review.openstack.org/#/c/622415/ ?), but either way I keep running into import issues. I'll take another look tomorrow, but in the meantime I'm afraid I haven't made any meaningful progress. :-( On 1/14/19 6:36 AM, Akihiro Motoki wrote:
The similar failure happens in neutron-fwaas. This blocks several patches in neutron-fwaas including policy-in-code support. https://bugs.launchpad.net/neutron/+bug/1811506
Most failures are fixed by applying Ben's neutron fix https://review.openstack.org/#/c/629335/ [1], but we still have one failure in neutron_fwaas.tests.functional.privileged.test_utils.InNamespaceTest.test_in_namespace [2]. This failure is caused by oslo.privsep 1.31.0 too. This does not happen with 1.30.1. Any help would be appreciated.
[1] neutron-fwaas change https://review.openstack.org/#/c/630451/ [2] http://logs.openstack.org/51/630451/2/check/legacy-neutron-fwaas-dsvm-functi...
-- Akihiro Motoki (irc: amotoki)
2019年1月9日(水) 9:32 Ben Nemec <openstack@nemebean.com <mailto:openstack@nemebean.com>>:
I think I've got it. At least in my local tests, the handle pointer being passed from C -> Python -> C was getting truncated at the Python step because we didn't properly define the type. If the address assigned was larger than would fit in a standard int then we passed what amounted to a bogus pointer back to the C code, which caused the segfault.
I have no idea why privsep threading would have exposed this, other than maybe running in threads affected the address space somehow?
In any case, https://review.openstack.org/629335 has got these functional tests working for me locally in oslo.privsep 1.31.0. It would be great if somebody could try them out and verify that I didn't just find a solution that somehow only works on my system. :-)
-Ben
On 1/8/19 4:30 PM, Ben Nemec wrote: > > > On 1/8/19 2:22 PM, Slawomir Kaplonski wrote: >> Hi Ben, >> >> I was also looking at it today. I’m totally not an C and Oslo.privsep >> expert but I think that there is some new process spawned here. >> I put pdb before line >> https://github.com/openstack/neutron/blob/master/neutron/privileged/agent/li...
>> where this issue happen. Then, with "ps aux” I saw: >> >> vagrant@fullstack-ubuntu ~ $ ps aux | grep privsep >> root 18368 0.1 0.5 185752 33544 pts/1 Sl+ 13:24 0:00 >> /opt/stack/neutron/.tox/dsvm-functional/bin/python >> /opt/stack/neutron/.tox/dsvm-functional/bin/privsep-helper >> --config-file neutron/tests/etc/neutron.conf --privsep_context >> neutron.privileged.default --privsep_sock_path >> /tmp/tmpG5iqb9/tmp1dMGq0/privsep.sock >> vagrant 18555 0.0 0.0 14512 1092 pts/2 S+ 13:25 0:00 grep >> --color=auto privsep >> >> But then when I continue run test, and it segfaulted, in journal log I >> have: >> >> Jan 08 13:25:29 fullstack-ubuntu kernel: privsep-helper[18369] >> segfault at 140043e8 ip 00007f8e1800ef32 sp 00007f8e18a63320 error 4 >> in libnetfilter_conntrack.so.3.5.0[7f8e18009000+1a000] >> >> Please check pics of those processes. First one (when test was >> „paused” with pdb) has 18368 and later segfault has 18369. > > privsep-helper does fork, so I _think_ that's normal. > > https://github.com/openstack/oslo.privsep/blob/ecb1870c29b760f09fb933fc8ebb3...
> > >> >> I don’t know if You saw my today’s comment in launchpad. I was trying >> to change method used to start PrivsepDaemon from Method.ROOTWRAP to >> Method.FORK (in >> https://github.com/openstack/oslo.privsep/blob/master/oslo_privsep/priv_cont...)
>> and run test as root, then tests were passed. > > Yeah, I saw that, but I don't understand it. :-/ > > The daemon should end up running with the same capabilities in either > case. By the time it starts making the C calls the environment should be > identical, regardless of which method was used to start the process. > >> >> — >> Slawek Kaplonski >> Senior software engineer >> Red Hat >> >>> Wiadomość napisana przez Ben Nemec <openstack@nemebean.com <mailto:openstack@nemebean.com>> w dniu >>> 08.01.2019, o godz. 20:04: >>> >>> Further update: I dusted off my gdb skills and attached it to the >>> privsep process to try to get more details about exactly what is >>> crashing. It looks like the segfault happens on this line: >>> >>> https://git.netfilter.org/libnetfilter_conntrack/tree/src/conntrack/api.c#n2...
>>> >>> >>> which is >>> >>> h->cb = cb; >>> >>> h being the conntrack handle and cb being the callback function. >>> >>> This makes me think the problem isn't the callback itself (even if we >>> assigned a bogus pointer, which we didn't, it shouldn't cause a >>> segfault unless you try to dereference it) but in the handle we pass >>> in. Trying to look at h->cb results in: >>> >>> (gdb) print h->cb >>> Cannot access memory at address 0x800f228 >>> >>> Interestingly, h itself is fine: >>> >>> (gdb) print h >>> $3 = (struct nfct_handle *) 0x800f1e0 >>> >>> It doesn't _look_ to me like the handle should be crossing any thread >>> boundaries or anything, so I'm not sure why it would be a problem. It >>> gets created in the same privileged function that ultimately >>> registers the callback: >>> https://github.com/openstack/neutron/blob/aa8a6ea848aae6882abb631b7089836dee...
>>> >>> >>> So still not sure what's going on, but I thought I'd share what I've >>> found before I stop to eat something. >>> >>> -Ben >>> >>> On 1/7/19 12:11 PM, Ben Nemec wrote: >>>> Renamed the thread to be more descriptive. >>>> Just to update the list on this, it looks like the problem is a >>>> segfault when the netlink_lib module makes a C call. Digging into >>>> that code a bit, it appears there is a callback being used[1]. I've >>>> seen some comments that when you use a callback with a Python >>>> thread, the thread needs to be registered somehow, but this is all >>>> uncharted territory for me. Suggestions gratefully accepted. :-) >>>> 1: >>>> https://github.com/openstack/neutron/blob/master/neutron/privileged/agent/li...
>>>> On 1/4/19 7:28 AM, Slawomir Kaplonski wrote: >>>>> Hi, >>>>> >>>>> I just found that functional tests in Neutron are failing since >>>>> today or maybe yesterday. See [1] >>>>> I was able to reproduce it locally and it looks that it happens >>>>> with oslo.privsep==1.31. With oslo.privsep==1.30.1 tests are fine. >>>>> >>>>> [1] https://bugs.launchpad.net/neutron/+bug/1810518 >>>>> >>>>> — >>>>> Slawek Kaplonski >>>>> Senior software engineer >>>>> Red Hat >>>>> >> >
Ben Nemec <openstack@nemebean.com> writes:
I tried to set up a test environment for this, but I'm having some issues. My local environment is defaulting to python 3, while the gate job appears to have been running under python 2. I'm not sure why it's doing that since the tox env definition doesn't specify python 3 (maybe something to do with https://review.openstack.org/#/c/622415/ ?), but either way I keep running into import issues.
I'll take another look tomorrow, but in the meantime I'm afraid I haven't made any meaningful progress. :-(
If no version is specified in the tox.ini then tox defaults to the version of python used to install it. -- Doug
On 1/15/19 6:49 AM, Doug Hellmann wrote:
Ben Nemec <openstack@nemebean.com> writes:
I tried to set up a test environment for this, but I'm having some issues. My local environment is defaulting to python 3, while the gate job appears to have been running under python 2. I'm not sure why it's doing that since the tox env definition doesn't specify python 3 (maybe something to do with https://review.openstack.org/#/c/622415/ ?), but either way I keep running into import issues.
I'll take another look tomorrow, but in the meantime I'm afraid I haven't made any meaningful progress. :-(
If no version is specified in the tox.ini then tox defaults to the version of python used to install it.
Ah, good to know. I think I installed tox as just "tox" instead of "python-tox", which means I got the py3 version. Unfortunately I'm still having trouble running the failing test (and not for the expected reason ;-). The daemon is failing to start with: ImportError: No module named tests.functional.utils I'm not seeing any log output from the daemon either for some reason so it's hard to debug. There must be some difference between this and the neutron test environment because in neutron I was getting daemon log output in /opt/stack/logs. I'll keep poking at it but I'm open to suggestions.
On 1/15/19 11:16 AM, Ben Nemec wrote:
On 1/15/19 6:49 AM, Doug Hellmann wrote:
Ben Nemec <openstack@nemebean.com> writes:
I tried to set up a test environment for this, but I'm having some issues. My local environment is defaulting to python 3, while the gate job appears to have been running under python 2. I'm not sure why it's doing that since the tox env definition doesn't specify python 3 (maybe something to do with https://review.openstack.org/#/c/622415/ ?), but either way I keep running into import issues.
I'll take another look tomorrow, but in the meantime I'm afraid I haven't made any meaningful progress. :-(
If no version is specified in the tox.ini then tox defaults to the version of python used to install it.
Ah, good to know. I think I installed tox as just "tox" instead of "python-tox", which means I got the py3 version.
Unfortunately I'm still having trouble running the failing test (and not for the expected reason ;-). The daemon is failing to start with:
ImportError: No module named tests.functional.utils
I'm not seeing any log output from the daemon either for some reason so it's hard to debug. There must be some difference between this and the neutron test environment because in neutron I was getting daemon log output in /opt/stack/logs.
Figured this part out. tox.ini wasn't inheriting some values in the same way as neutron. Fix proposed in https://review.openstack.org/#/c/631035/ Now hopefully I can make progress on the rest of it.
I'll keep poking at it but I'm open to suggestions.
TLDR: We now need to look at the thread namespace instead of the process namespace. Many, many details below. On 1/15/19 11:51 AM, Ben Nemec wrote:
On 1/15/19 11:16 AM, Ben Nemec wrote:
On 1/15/19 6:49 AM, Doug Hellmann wrote:
Ben Nemec <openstack@nemebean.com> writes:
I tried to set up a test environment for this, but I'm having some issues. My local environment is defaulting to python 3, while the gate job appears to have been running under python 2. I'm not sure why it's doing that since the tox env definition doesn't specify python 3 (maybe something to do with https://review.openstack.org/#/c/622415/ ?), but either way I keep running into import issues.
I'll take another look tomorrow, but in the meantime I'm afraid I haven't made any meaningful progress. :-(
If no version is specified in the tox.ini then tox defaults to the version of python used to install it.
Ah, good to know. I think I installed tox as just "tox" instead of "python-tox", which means I got the py3 version.
Unfortunately I'm still having trouble running the failing test (and not for the expected reason ;-). The daemon is failing to start with:
ImportError: No module named tests.functional.utils
No idea why, but updating the fwaas capabilities to match core neutron by adding c.CAP_DAC_OVERRIDE and c.CAP_DAC_READ_SEARCH made this go away. Those are related to file permission checks, but the permissions on my source tree are, well, permissive, so I'm not sure why that would be a problem.
I'm not seeing any log output from the daemon either for some reason so it's hard to debug. There must be some difference between this and the neutron test environment because in neutron I was getting daemon log output in /opt/stack/logs.
Figured this part out. tox.ini wasn't inheriting some values in the same way as neutron. Fix proposed in https://review.openstack.org/#/c/631035/
Actually, I discovered that these logs were happening, they were just in /tmp. So that change is probably not necessary, especially since it's breaking ci.
Now hopefully I can make progress on the rest of it.
And sure enough, I did. :-) In short, we need to look at the thread-specific network namespace in this test instead of the process-specific one. When we change the namespace it only affects the thread, unless the call is made from the process's main thread. Here's a simple(?) example: #!/usr/bin/env python import ctypes import os import threading from pyroute2 import netns # The python threading identifier is useless here, # we need to make a syscall libc = ctypes.CDLL('libc.so.6') def do_the_thing(ns): tid = libc.syscall(186) # This id varies by platform :-/ # Check the starting netns print('process %s' % os.readlink('/proc/self/ns/net')) print('thread %s' % os.readlink('/proc/self/task/%s/ns/net' % tid)) # Change the netns print('changing to %s' % ns) netns.setns(ns) # Check again. It should be different print('process %s' % os.readlink('/proc/self/ns/net')) print('thread %s\n' % os.readlink('/proc/self/task/%s/ns/net' % tid)) # Run in main thread do_the_thing('foo') # Run in new thread t = threading.Thread(target=do_the_thing, args=('bar',)) t.start() t.join() # Run in main thread again to show difference do_the_thing('bar') # Clean up after ourselves netns.remove('foo') netns.remove('bar') And here's the output: process net:[4026531992] thread net:[4026531992] changing to foo process net:[4026532196] <- Running in the main thread changes both thread net:[4026532196] process net:[4026532196] thread net:[4026532196] changing to bar process net:[4026532196] <- Child thread only changes the thread thread net:[4026532254] process net:[4026532196] thread net:[4026532196] changing to bar process net:[4026532254] <- Main thread gets them back in sync thread net:[4026532254] So, to get this test passing I think we need to change [1] so it looks for the thread id and uses a replacement for [2] that allows the thread id to be injected as above. And it's the end of my day so I'm going to leave it there. :-) 1: https://github.com/openstack/neutron-fwaas/blob/master/neutron_fwaas/privile... 2: https://github.com/openstack/neutron-fwaas/blob/master/neutron_fwaas/privile... -Ben
Thanks Ben for digging into the detail. I made some more tests based on your test script.
From my test result, pyroute2 and "ip" command operations against netns seems to work fine even if network namespaces of the process and thread are different. The test script iis http://paste.openstack.org/show/742886/ and the result is http://paste.openstack.org/show/742887/.
So, to get this test passing I think we need to change [1] so it looks for the thread id and uses a replacement for [2] that allows the thread id to be injected as above.
I confirmed network namespace operations work well, so it looks safe. Considering the situation, I proposed a change on the failing test to check a list of network devices inside a netns. https://review.openstack.org/#/c/631654/ Thanks, Akihiro Motoki (irc: amotoki) 2019年1月16日(水) 7:56 Ben Nemec <openstack@nemebean.com>:
TLDR: We now need to look at the thread namespace instead of the process namespace. Many, many details below.
On 1/15/19 11:51 AM, Ben Nemec wrote:
On 1/15/19 11:16 AM, Ben Nemec wrote:
On 1/15/19 6:49 AM, Doug Hellmann wrote:
Ben Nemec <openstack@nemebean.com> writes:
I tried to set up a test environment for this, but I'm having some issues. My local environment is defaulting to python 3, while the gate job appears to have been running under python 2. I'm not sure why it's doing that since the tox env definition doesn't specify python 3
(maybe
something to do with https://review.openstack.org/#/c/622415/ ?), but either way I keep running into import issues.
I'll take another look tomorrow, but in the meantime I'm afraid I haven't made any meaningful progress. :-(
If no version is specified in the tox.ini then tox defaults to the version of python used to install it.
Ah, good to know. I think I installed tox as just "tox" instead of "python-tox", which means I got the py3 version.
Unfortunately I'm still having trouble running the failing test (and not for the expected reason ;-). The daemon is failing to start with:
ImportError: No module named tests.functional.utils
No idea why, but updating the fwaas capabilities to match core neutron by adding c.CAP_DAC_OVERRIDE and c.CAP_DAC_READ_SEARCH made this go away. Those are related to file permission checks, but the permissions on my source tree are, well, permissive, so I'm not sure why that would be a problem.
I'm not seeing any log output from the daemon either for some reason so it's hard to debug. There must be some difference between this and the neutron test environment because in neutron I was getting daemon log output in /opt/stack/logs.
Figured this part out. tox.ini wasn't inheriting some values in the same way as neutron. Fix proposed in https://review.openstack.org/#/c/631035/
Actually, I discovered that these logs were happening, they were just in /tmp. So that change is probably not necessary, especially since it's breaking ci.
Now hopefully I can make progress on the rest of it.
And sure enough, I did. :-)
In short, we need to look at the thread-specific network namespace in this test instead of the process-specific one. When we change the namespace it only affects the thread, unless the call is made from the process's main thread. Here's a simple(?) example:
#!/usr/bin/env python
import ctypes import os import threading
from pyroute2 import netns
# The python threading identifier is useless here, # we need to make a syscall libc = ctypes.CDLL('libc.so.6')
def do_the_thing(ns): tid = libc.syscall(186) # This id varies by platform :-/ # Check the starting netns print('process %s' % os.readlink('/proc/self/ns/net')) print('thread %s' % os.readlink('/proc/self/task/%s/ns/net' % tid)) # Change the netns print('changing to %s' % ns) netns.setns(ns) # Check again. It should be different print('process %s' % os.readlink('/proc/self/ns/net')) print('thread %s\n' % os.readlink('/proc/self/task/%s/ns/net' % tid))
# Run in main thread do_the_thing('foo') # Run in new thread t = threading.Thread(target=do_the_thing, args=('bar',)) t.start() t.join() # Run in main thread again to show difference do_the_thing('bar')
# Clean up after ourselves netns.remove('foo') netns.remove('bar')
And here's the output:
process net:[4026531992] thread net:[4026531992] changing to foo process net:[4026532196] <- Running in the main thread changes both thread net:[4026532196]
process net:[4026532196] thread net:[4026532196] changing to bar process net:[4026532196] <- Child thread only changes the thread thread net:[4026532254]
process net:[4026532196] thread net:[4026532196] changing to bar process net:[4026532254] <- Main thread gets them back in sync thread net:[4026532254]
So, to get this test passing I think we need to change [1] so it looks for the thread id and uses a replacement for [2] that allows the thread id to be injected as above.
And it's the end of my day so I'm going to leave it there. :-)
1:
https://github.com/openstack/neutron-fwaas/blob/master/neutron_fwaas/privile... 2:
https://github.com/openstack/neutron-fwaas/blob/master/neutron_fwaas/privile...
-Ben
participants (5)
-
Akihiro Motoki
-
Ben Nemec
-
Brian Haley
-
Doug Hellmann
-
Slawomir Kaplonski