[nova][neutron][ptg] Summary: Leaking resources when ports are deleted out-of-band
Summary: When a port is deleted out of band (while still attached to an instance), any associated QoS bandwidth resources are orphaned in placement. Consensus: - Neutron to block deleting a port whose "owner" field is set. - If you really want to do this, null the "owner" field first. - Nova still needs a way to delete the port during destroy. To be discussed. Possibilities: - Nova can null the "owner" field first. - The operation can be permitted with a certain policy role, which Nova would have to be granted. - Other? efried .
On Fri, May 3, 2019 at 3:20 PM, Eric Fried <openstack@fried.cc> wrote:
Summary: When a port is deleted out of band (while still attached to an instance), any associated QoS bandwidth resources are orphaned in placement.
Consensus: - Neutron to block deleting a port whose "owner" field is set. - If you really want to do this, null the "owner" field first. - Nova still needs a way to delete the port during destroy. To be discussed. Possibilities: - Nova can null the "owner" field first. - The operation can be permitted with a certain policy role, which Nova would have to be granted. - Other?
Two additions: 1) Nova will log an ERROR when the leak happens. (Nova knows the port_id and the RP UUID but doesn't know the size of the allocation to remove it). This logging can be added today. 2) Matt had a point after the session that if Neutron enforces that only unbound port can be deleted then not only Nova needs to be changed to unbound a port before delete it, but possibly other Neutron consumers (Octavia?). Cheers, gibi
efried .
On 5/3/2019 3:35 PM, Balázs Gibizer wrote:
2) Matt had a point after the session that if Neutron enforces that only unbound port can be deleted then not only Nova needs to be changed to unbound a port before delete it, but possibly other Neutron consumers (Octavia?).
And potentially Zun, there might be others, Magnum, Heat, idk? Anyway, this is a thing that has been around forever which admins shouldn't do, do we need to prioritize making this change in both neutron and nova to make two requests to delete a bound port? Or is just logging the ERROR that you've leaked allocations, tsk tsk, enough? I tend to think the latter is fine until someone comes along saying this is really hurting them and they have a valid use case for deleting bound ports out of band from nova. -- Thanks, Matt
On Fri, May 3, 2019 at 4:11 PM Matt Riedemann <mriedemos@gmail.com> wrote:
On 5/3/2019 3:35 PM, Balázs Gibizer wrote:
2) Matt had a point after the session that if Neutron enforces that only unbound port can be deleted then not only Nova needs to be changed to unbound a port before delete it, but possibly other Neutron consumers (Octavia?).
And potentially Zun, there might be others, Magnum, Heat, idk?
Anyway, this is a thing that has been around forever which admins shouldn't do, do we need to prioritize making this change in both neutron and nova to make two requests to delete a bound port? Or is just logging the ERROR that you've leaked allocations, tsk tsk, enough? I tend to think the latter is fine until someone comes along saying this is really hurting them and they have a valid use case for deleting bound ports out of band from nova.
neutron deines a special role called "advsvc" for advanced network services [1]. I think we can change neutron to block deletion of bound ports for regular users and allow users with "advsvc" role to delete bound ports. I haven't checked which projects currently use "advsvc". [1] https://opendev.org/openstack/neutron/src/branch/master/neutron/conf/policie...
--
Thanks,
Matt
I think this will have implications for Octavia, but we can work through those. There are cases during cleanup from an error where we delete ports owned by "Octavia" that have not yet be attached to a nova instance. My understanding of the above discussion is that this would not be an issue under this change. However.... We also, currently, manipulate the ports we have hot-plugged (attached) to nova instances where the port "device_owner" has become "compute:nova", mostly for failover scenarios and cases where nova detach fails and we have to revert the action. Now, if the "proper" new procedure is to first detach before deleting the port, we can look at attempting that. But, in the common failure scenarios we see nova failing to complete this, if for example the compute host has been powered off. In this scenario we still need to delete the neutron port for both resource cleanup and quota reasons. This so we can create a new port and attach it to a new instance to recover. I think this change will impact our current port manage flows, so we should proceed cautiously, test heavily, and potentially address some of the nova failure scenarios at the same time. Michael On Fri, May 3, 2019 at 5:23 PM Akihiro Motoki <amotoki@gmail.com> wrote:
On Fri, May 3, 2019 at 4:11 PM Matt Riedemann <mriedemos@gmail.com> wrote:
On 5/3/2019 3:35 PM, Balázs Gibizer wrote:
2) Matt had a point after the session that if Neutron enforces that only unbound port can be deleted then not only Nova needs to be changed to unbound a port before delete it, but possibly other Neutron consumers (Octavia?).
And potentially Zun, there might be others, Magnum, Heat, idk?
Anyway, this is a thing that has been around forever which admins shouldn't do, do we need to prioritize making this change in both neutron and nova to make two requests to delete a bound port? Or is just logging the ERROR that you've leaked allocations, tsk tsk, enough? I tend to think the latter is fine until someone comes along saying this is really hurting them and they have a valid use case for deleting bound ports out of band from nova.
neutron deines a special role called "advsvc" for advanced network services [1]. I think we can change neutron to block deletion of bound ports for regular users and allow users with "advsvc" role to delete bound ports. I haven't checked which projects currently use "advsvc".
[1] https://opendev.org/openstack/neutron/src/branch/master/neutron/conf/policie...
--
Thanks,
Matt
On Sat, May 4, 2019 at 10:25 AM, Michael Johnson <johnsomor@gmail.com> wrote:
I think this will have implications for Octavia, but we can work through those.
There are cases during cleanup from an error where we delete ports owned by "Octavia" that have not yet be attached to a nova instance. My understanding of the above discussion is that this would not be an issue under this change.
If the port is owned by Octavia then the resource leak does not happen. However the propose neutron code / policy change affects this case as well.
However....
We also, currently, manipulate the ports we have hot-plugged (attached) to nova instances where the port "device_owner" has become "compute:nova", mostly for failover scenarios and cases where nova detach fails and we have to revert the action.
Now, if the "proper" new procedure is to first detach before deleting the port, we can look at attempting that. But, in the common failure scenarios we see nova failing to complete this, if for example the compute host has been powered off. In this scenario we still need to delete the neutron port for both resource cleanup and quota reasons. This so we can create a new port and attach it to a new instance to recover.
If Octavai also deletes the VM then force deleting the port is OK from placement resource prespective as the VM delete will make sure we are deleting the leaked port resources.
I think this change will impact our current port manage flows, so we should proceed cautiously, test heavily, and potentially address some of the nova failure scenarios at the same time.
After talking to rm_work on #openstack-nova [1] it feels that the policy based solution would work for Octavia. So Octavia with the extra policy can still delete the bound port in Neutron safely as Ocatavia also deletes the VM that the port was bound to. That VM delete will reclaim the leaked port resource. The failure to detach a port via nova while the nova-compute is down could be a bug on nova side. cheers, gibi [1] http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2...
Michael
On Fri, May 3, 2019 at 5:23 PM Akihiro Motoki <amotoki@gmail.com> wrote:
On Fri, May 3, 2019 at 4:11 PM Matt Riedemann <mriedemos@gmail.com> wrote:
2) Matt had a point after the session that if Neutron enforces
On 5/3/2019 3:35 PM, Balázs Gibizer wrote: that
only unbound port can be deleted then not only Nova needs to be changed to unbound a port before delete it, but possibly other Neutron consumers (Octavia?).
And potentially Zun, there might be others, Magnum, Heat, idk?
Anyway, this is a thing that has been around forever which admins shouldn't do, do we need to prioritize making this change in both neutron and nova to make two requests to delete a bound port? Or is just logging the ERROR that you've leaked allocations, tsk tsk, enough? I tend to think the latter is fine until someone comes along saying this is really hurting them and they have a valid use case for deleting bound ports out of band from nova.
neutron deines a special role called "advsvc" for advanced network services [1]. I think we can change neutron to block deletion of bound ports for regular users and allow users with "advsvc" role to delete bound ports. I haven't checked which projects currently use "advsvc".
--
Thanks,
Matt
On 5/4/2019 11:57 AM, Balázs Gibizer wrote:
The failure to detach a port via nova while the nova-compute is down could be a bug on nova side.
Depends on what you mean by detach. If the compute is down while deleting the server, the API will still call the (internal to nova) network API code [1] to either (a) unbind ports that nova didn't create or (2) delete ports that nova did create. For the policy change where the port has to be unbound to delete it, we'd already have support for that, it's just an extra step. At the PTG I was groaning a bit about needing to add another step to delete a port from the nova side, but thinking about it more we have to do the exact same thing with cinder volumes (we have to detach them before deleting them), so I guess it's not the worst thing ever. [1] https://github.com/openstack/nova/blob/56fef7c0e74d7512f062c4046def10401df16... -- Thanks, Matt
On Wed, May 8, 2019 at 6:18 PM, Matt Riedemann <mriedemos@gmail.com> wrote:
On 5/4/2019 11:57 AM, Balázs Gibizer wrote:
The failure to detach a port via nova while the nova-compute is down could be a bug on nova side.
Depends on what you mean by detach. If the compute is down while deleting the server, the API will still call the (internal to nova) network API code [1] to either (a) unbind ports that nova didn't create or (2) delete ports that nova did create.
This sentence based on the reported bug [2]. The reason while Octavia is unbinding the port in Neutron instead of via Nova is that Nova fails to detach the interface and unbind the port if the nova-compute is down. In that bug we discussing if it would be meaningful to do a local interface detach (unvind port in neutron + deallocate port resource in placement) in the nova-api if the compute is done similar to the local server delete. [2] https://bugs.launchpad.net/nova/+bug/1827746
For the policy change where the port has to be unbound to delete it, we'd already have support for that, it's just an extra step.
At the PTG I was groaning a bit about needing to add another step to delete a port from the nova side, but thinking about it more we have to do the exact same thing with cinder volumes (we have to detach them before deleting them), so I guess it's not the worst thing ever.
As soon as somebody from Neutron states that the neutron policy patch is on the way I can start working on the Nova side of this. Cheers, gibi
--
Thanks,
Matt
On 5/9/2019 4:19 AM, Balázs Gibizer wrote:
This sentence based on the reported bug [2]. The reason while Octavia is unbinding the port in Neutron instead of via Nova is that Nova fails to detach the interface and unbind the port if the nova-compute is down. In that bug we discussing if it would be meaningful to do a local interface detach (unvind port in neutron + deallocate port resource in placement) in the nova-api if the compute is done similar to the local server delete.
Oh OK I was confusing this with deleting the VM while the compute host was down, not detaching the port from the server while the compute was down. Yeah I'm not sure what we'd want to do there. We could obviously do the same thing we do for VM delete in the API while the compute host is down, but could we be leaking things on the compute host in that case if the VIF was never properly unplugged? I'd think that is already an issue for local delete of the VM in the API if the compute comes back up later (maybe there is something in the compute service on startup that will do cleanup, I'm not sure off the top of my head). -- Thanks, Matt
1) Nova will log an ERROR when the leak happens. (Nova knows the port_id and the RP UUID but doesn't know the size of the allocation to remove it). This logging can be added today.
Path is up with an ERROR log: https://review.opendev.org/#/c/657079/ gibi
participants (5)
-
Akihiro Motoki
-
Balázs Gibizer
-
Eric Fried
-
Matt Riedemann
-
Michael Johnson