[nova] critical bug around reload/upgrades
Hello: I've discussed this for quite sometime with Dan over IRC and a bit with Zane as well, but basically, Nova thinks that when it gets a reload (aka SIGHUP), nothing else has occurred. However, oslo.service actually calls stop(), reload() then start() again, which potentially kills all RPC. This has caused a pretty big issue in our gates and it also means that the whole idea behind 'reload nova-compute while upgrading to refresh info' concept is fundamentally broken. I tried to do some work on this here, however, I wasn't really able to get to the bottom of it. There seems to be a decision that needs to be taken in terms of .. do we change what reload() actually means in oslo_service (it actually is more like a restart, not a reload) or does Nova (and other projects) change their implementation in assuming what reload() does? https://review.openstack.org/#/c/641907/ This seems to have been floating around for a really long time, so I'd be happy to work with someone to find the fix (and we can totally test it inside OpenStack Ansible by reloading instead of restarting). Thanks! Mohammed -- Mohammed Naser — vexxhost ----------------------------------------------------- D. 514-316-8872 D. 800-910-1726 ext. 200 E. mnaser@vexxhost.com W. http://vexxhost.com
On Sun, 24 Mar 2019 at 00:35, Mohammed Naser <mnaser@vexxhost.com> wrote:
Hello:
I've discussed this for quite sometime with Dan over IRC and a bit with Zane as well, but basically, Nova thinks that when it gets a reload (aka SIGHUP), nothing else has occurred.
However, oslo.service actually calls stop(), reload() then start() again, which potentially kills all RPC. This has caused a pretty big issue in our gates and it also means that the whole idea behind 'reload nova-compute while upgrading to refresh info' concept is fundamentally broken.
I tried to do some work on this here, however, I wasn't really able to get to the bottom of it. There seems to be a decision that needs to be taken in terms of .. do we change what reload() actually means in oslo_service (it actually is more like a restart, not a reload) or does Nova (and other projects) change their implementation in assuming what reload() does?
https://review.openstack.org/#/c/641907/
This seems to have been floating around for a really long time, so I'd be happy to work with someone to find the fix (and we can totally test it inside OpenStack Ansible by reloading instead of restarting).
Thanks for bringing this up Mohammed, I would also like to see a solution for this. We go with a hard restart of the service in kolla-ansible as a workaround. It would be nice it we could do a more lightweight HUP.
Thanks! Mohammed
-- Mohammed Naser — vexxhost ----------------------------------------------------- D. 514-316-8872 D. 800-910-1726 ext. 200 E. mnaser@vexxhost.com W. http://vexxhost.com
On Mon, Mar 25, 2019 at 6:02 AM Mark Goddard <mark@stackhpc.com> wrote:
On Sun, 24 Mar 2019 at 00:35, Mohammed Naser <mnaser@vexxhost.com> wrote:
Hello:
I've discussed this for quite sometime with Dan over IRC and a bit with Zane as well, but basically, Nova thinks that when it gets a reload (aka SIGHUP), nothing else has occurred.
However, oslo.service actually calls stop(), reload() then start() again, which potentially kills all RPC. This has caused a pretty big issue in our gates and it also means that the whole idea behind 'reload nova-compute while upgrading to refresh info' concept is fundamentally broken.
I tried to do some work on this here, however, I wasn't really able to get to the bottom of it. There seems to be a decision that needs to be taken in terms of .. do we change what reload() actually means in oslo_service (it actually is more like a restart, not a reload) or does Nova (and other projects) change their implementation in assuming what reload() does?
https://review.openstack.org/#/c/641907/
This seems to have been floating around for a really long time, so I'd be happy to work with someone to find the fix (and we can totally test it inside OpenStack Ansible by reloading instead of restarting).
Thanks for bringing this up Mohammed, I would also like to see a solution for this. We go with a hard restart of the service in kolla-ansible as a workaround. It would be nice it we could do a more lightweight HUP.
Looks like some progress has been made but we're pretty confident that this is more and more an Oslo.service bug: Matt & Dan have both left ideas around this with possible solutions on how to make a change like this back portable.. https://review.openstack.org/#/c/641907/ Thanks.
Thanks! Mohammed
-- Mohammed Naser — vexxhost ----------------------------------------------------- D. 514-316-8872 D. 800-910-1726 ext. 200 E. mnaser@vexxhost.com W. http://vexxhost.com
-- Mohammed Naser — vexxhost ----------------------------------------------------- D. 514-316-8872 D. 800-910-1726 ext. 200 E. mnaser@vexxhost.com W. http://vexxhost.com
On 3/28/2019 7:42 PM, Mohammed Naser wrote:
Looks like some progress has been made but we're pretty confident that this is more and more an Oslo.service bug:
Matt & Dan have both left ideas around this with possible solutions on how to make a change like this back portable..
Another update on this, but I was trying to recreate the original reported issue in the nova bug: https://bugs.launchpad.net/nova/+bug/1715374 And I didn't even get to the point of the libvirt driver waiting for the network-vif-plugged event because privsep blows up much earlier during server create after SIGHUP'ing the service. Details start at comment 34 in that bug, but the tl;dr is the privsep-helper child processes are gone after the SIGHUP so anything that relies on privsep (which is anything using root in the libvirt driver and os-vif utils code now I think) won't work until you restart the service. I don't yet know if this is a regression in Stein but I'm going to create a stable/rocky devstack and try to find out. -- Thanks, Matt
On 4/3/19 1:20 PM, Matt Riedemann wrote:
On 3/28/2019 7:42 PM, Mohammed Naser wrote:
Looks like some progress has been made but we're pretty confident that this is more and more an Oslo.service bug:
Matt & Dan have both left ideas around this with possible solutions on how to make a change like this back portable..
Another update on this, but I was trying to recreate the original reported issue in the nova bug:
https://bugs.launchpad.net/nova/+bug/1715374
And I didn't even get to the point of the libvirt driver waiting for the network-vif-plugged event because privsep blows up much earlier during server create after SIGHUP'ing the service. Details start at comment 34 in that bug, but the tl;dr is the privsep-helper child processes are gone after the SIGHUP so anything that relies on privsep (which is anything using root in the libvirt driver and os-vif utils code now I think) won't work until you restart the service.
I don't yet know if this is a regression in Stein but I'm going to create a stable/rocky devstack and try to find out.
With that oslo.service patch [1] in place, I recreated Matt's result as described above. Then I hacked on oslo.privsep a bit [2] and was able to resolve the issue (create instances smoothly after SIGHUPping n-cpu.service). That fix is going to need UT, but also more thread- and socket- and security-savvy eyeballs to make sure it has legs. But hopefully we can finally put this one to bed. efried [1] https://review.opendev.org/#/c/641907/ [2] https://review.opendev.org/#/c/678323/
participants (4)
-
Eric Fried
-
Mark Goddard
-
Matt Riedemann
-
Mohammed Naser