[nova] Thoughts on exposing exception type to non-admins in instance action event
tl;dr: What do people think about storing and showing the *type* of exception that is recorded with a failed instance action event (like a fault) to the owner of the server who may not be an admin? Details: As noted here [1] and recreated here [2] the instance action event details that a non-admin owner of a server sees do not contain any useful information about what caused the failure of the action. Here is an example of a failed resize from that paste (this is what the non-admin owner of the server would see): $ openstack --os-compute-api-version 2.51 server event show vm2 req-11487504-da59-411b-b3b8-267bebe9b0d2 -f json -c events { "events": [ { "finish_time": "2019-11-13T16:18:27.000000", "start_time": "2019-11-13T16:18:26.000000", "event": "cold_migrate", "result": "Error" }, { "finish_time": "2019-11-13T16:18:27.000000", "start_time": "2019-11-13T16:18:26.000000", "event": "conductor_migrate_server", "result": "Error" } ] } Super useful, right? In this case scheduling failed for the resize so the instance is not in ERROR status which means the user cannot see a fault message with the NoValidHost error either. The admin can see the traceback in the failed action event list: $ openstack --os-compute-api-version 2.51 server event show 3ef043ea-e2d7-4565-a401-5c758e149f23 req-11487504-da59-411b-b3b8-267bebe9b0d2 -f json -c events { "events": [ { "finish_time": "2019-11-13T16:18:27.000000", "start_time": "2019-11-13T16:18:26.000000", "traceback": " File \"/opt/stack/nova/nova/conductor/manager.py\", line 301, in migrate_server\n host_list)\n File \"/opt/stack/nova/nova/conductor/manager.py\", line 367, in _cold_migrate\n raise exception.NoValidHost(reason=msg)\n", "event": "cold_migrate", "result": "Error" }, { "finish_time": "2019-11-13T16:18:27.000000", "start_time": "2019-11-13T16:18:26.000000", "traceback": " File \"/opt/stack/nova/nova/compute/utils.py\", line 1411, in decorated_function\n return function(self, context, *args, **kwargs)\n File \"/opt/stack/nova/nova/conductor/manager.py\", line 301, in migrate_server\n host_list)\n File \"/opt/stack/nova/nova/conductor/manager.py\", line 367, in _cold_migrate\n raise exception.NoValidHost(reason=msg)\n", "event": "conductor_migrate_server", "result": "Error" } ] } So when the admin gets the support ticket they can at least tell that scheduling failed and then dig into why. My idea is to store the exception *type* with the action event, similar to the recorded instance fault message for non-NovaExceptions [3] which will show to the non-admin owner of the server if the server status is ERROR or DELETED [4]. We should record the exc_val to get a prettier message like "No valid host was found." but that could leak details in the error message that we don't want non-admins to see [5]. With what I'm thinking, the non-admin owner of the server could see something like this for a failed event: { "finish_time": "2019-11-13T16:18:27.000000", "start_time": "2019-11-13T16:18:26.000000", "event": "cold_migrate", "result": "Error", "details": "NoValidHost" } That's pretty simple, doesn't leak details, and at least indicates to the user that maybe they can retry the resize with another flavor or something. It's just an example. This would require a microversion so before writing a spec I wanted to get general feelings about this in the mailing list. I accept that it might not really be worth the effort so that's good feedback if it's how you feel (I'll only cry a little). [1] https://review.opendev.org/#/c/693937/2/nova/objects/instance_action.py [2] http://paste.openstack.org/show/786054/ [3] https://github.com/openstack/nova/blob/20.0.0/nova/compute/utils.py#L101 [4] https://github.com/openstack/nova/blob/20.0.0/nova/api/openstack/compute/vie... [5] https://bugs.launchpad.net/nova/+bug/1851587 -- Thanks, Matt
On 11/13/2019 10:51 AM, Matt Riedemann wrote:
We should record the exc_val to get a prettier message like "No valid host was found." but that could leak details in the error message that we don't want non-admins to see [5].
Typo above, should have been "We *could* record...". -- Thanks, Matt
Le mer. 13 nov. 2019 à 18:27, Eric Fried <openstack@fried.cc> a écrit :
Unless it's likely to be something other than NoValidHost a significant percentage of the time, IMO it...
On 11/13/19 10:51 AM, Matt Riedemann wrote:
might not really be worth the effort
efried .
FWIW, os-instance-actions is super useful for some ops, at least my customers :-) Having the exact same answer from this API than a nova show would be very nice honestly. So, yeah, please +1 to the spec and add me for a review :-)
On 11/13/2019 11:17 AM, Eric Fried wrote:
Unless it's likely to be something other than NoValidHost a significant percentage of the time, IMO it...
Well just taking resize, it could be one of many things: https://github.com/openstack/nova/blob/20.0.0/nova/conductor/manager.py#L366 - oops you tried resizing which would screw up your group affinity policy https://github.com/openstack/nova/blob/20.0.0/nova/compute/manager.py#L4490 - (for an admin, cold migrate) oops you tried cold migrating a vcenter vm or you have allow_resize_to_same_host=True and the scheduler picks the same host (silly scheduler, see bug 1748697) https://github.com/openstack/nova/blob/20.0.0/nova/compute/claims.py#L113 - oops you lost a resource claims race, try again https://github.com/openstack/nova/blob/20.0.0/nova/scheduler/client/report.p... - oops you lost a race with allocation consumer generation conflicts, try again -- Thanks, Matt
Okay, are we going to have a document that maps exception classes to these explanations and recovery actions? Which we then have to maintain as the code changes? Or are they expected to look through code (without a stack trace)? I'm not against the idea, just playing devil's advocate. Sylvain seems to have a use case, so great. As an alternative, have we considered a mechanism whereby we could, in appropriate code paths, provide some text that's expressly intended for the end user to see? Maybe it's a new user_message field on NovaException which, if present, gets percolated up to a new field similar to the one you suggested. efried On 11/13/19 11:41 AM, Matt Riedemann wrote:
On 11/13/2019 11:17 AM, Eric Fried wrote:
Unless it's likely to be something other than NoValidHost a significant percentage of the time, IMO it...
Well just taking resize, it could be one of many things:
https://github.com/openstack/nova/blob/20.0.0/nova/conductor/manager.py#L366 - oops you tried resizing which would screw up your group affinity policy
https://github.com/openstack/nova/blob/20.0.0/nova/compute/manager.py#L4490 - (for an admin, cold migrate) oops you tried cold migrating a vcenter vm or you have allow_resize_to_same_host=True and the scheduler picks the same host (silly scheduler, see bug 1748697)
https://github.com/openstack/nova/blob/20.0.0/nova/compute/claims.py#L113 - oops you lost a resource claims race, try again
https://github.com/openstack/nova/blob/20.0.0/nova/scheduler/client/report.p... - oops you lost a race with allocation consumer generation conflicts, try again
On 11/13/2019 2:38 PM, Eric Fried wrote:
Okay, are we going to have a document that maps exception classes to these explanations and recovery actions? Which we then have to maintain as the code changes? Or are they expected to look through code (without a stack trace)?
Nope.
I'm not against the idea, just playing devil's advocate. Sylvain seems to have a use case, so great.
Yeah I know. Like I said in the original email, just having the exception type might not be very useful to an end user. That's almost like just showing an error code that is then used by support staff. If we do expose the details as the formatted exception message, like we do for faults, then I think it would be more useful to end users, but then you also run into the same issues as we have for fault messages that maybe leak too much detail [1]. However, with the way I was thinking about doing this, the instance action code would use the same utility method that generates the fault message so if we fix [1] for faults it's also fixed for instance actions automatically. If I get the time this week I'll WIP something together that does what I'm thinking as a proof of concept, likely without the microversion stuff just since that's unnecessary overhead for a PoC.
As an alternative, have we considered a mechanism whereby we could, in appropriate code paths, provide some text that's expressly intended for the end user to see? Maybe it's a new user_message field on NovaException which, if present, gets percolated up to a new field similar to the one you suggested.
I think that likely becomes as whack-a-mole to contain as documenting all of the different types of errors. [1] https://bugs.launchpad.net/nova/+bug/1851587 -- Thanks, Matt
On 11/14/2019 7:58 AM, Matt Riedemann wrote:
If I get the time this week I'll WIP something together that does what I'm thinking as a proof of concept
Here is a simple PoC: https://review.opendev.org/#/q/topic:bp/action-event-fault-details The API change with a new microversion (sans API samples) is actually smaller than the object code change to store the fault message. Anyway, this gives an idea and it was pretty simple to write up. -- Thanks, Matt
On 13/11/19 14:38 -0600, Eric Fried wrote:
Okay, are we going to have a document that maps exception classes to these explanations and recovery actions? Which we then have to maintain as the code changes? Or are they expected to look through code (without a stack trace)?
I'm not against the idea, just playing devil's advocate. Sylvain seems to have a use case, so great.
As an alternative, have we considered a mechanism whereby we could, in appropriate code paths, provide some text that's expressly intended for the end user to see? Maybe it's a new user_message field on NovaException which, if present, gets percolated up to a new field similar to the one you suggested.
Would this be like the "user messages" provided by block [1] and file [2] storage components? [1] https://docs.openstack.org/cinder/latest/contributor/user_messages.html [2] https://docs.openstack.org/manila/latest/contributor/user_messages.html -- Tom
efried
On 11/13/19 11:41 AM, Matt Riedemann wrote:
On 11/13/2019 11:17 AM, Eric Fried wrote:
Unless it's likely to be something other than NoValidHost a significant percentage of the time, IMO it...
Well just taking resize, it could be one of many things:
https://github.com/openstack/nova/blob/20.0.0/nova/conductor/manager.py#L366 - oops you tried resizing which would screw up your group affinity policy
https://github.com/openstack/nova/blob/20.0.0/nova/compute/manager.py#L4490 - (for an admin, cold migrate) oops you tried cold migrating a vcenter vm or you have allow_resize_to_same_host=True and the scheduler picks the same host (silly scheduler, see bug 1748697)
https://github.com/openstack/nova/blob/20.0.0/nova/compute/claims.py#L113 - oops you lost a resource claims race, try again
https://github.com/openstack/nova/blob/20.0.0/nova/scheduler/client/report.p... - oops you lost a race with allocation consumer generation conflicts, try again
On 11/14/2019 8:30 AM, Tom Barron wrote:
Would this be like the "user messages" provided by block [1] and file [2] storage components?
[1] https://docs.openstack.org/cinder/latest/contributor/user_messages.html [2] https://docs.openstack.org/manila/latest/contributor/user_messages.html
The instance actions API in nova is very similar. Rather than build a new "user messages" API in nova I'm just talking about providing more detail on the actual error that occurred per failed event per action, basically the same as the user would see in a fault message on the server when it's in ERROR status. Because right now the instance action and events either say "Success" or "Error" for the message/result which is not useful in the Error case. -- Thanks, Matt
I would like to see this feature, our customers have mentioned the same problem, I think this is useful. I think that should consider of the all instance action operations, such as actions in nova/compute/instance_actions.py. brinzhang
主题: [lists.openstack.org代发]Re: [nova] Thoughts on exposing exception type to non-admins in instance action event
On 11/13/2019 11:17 AM, Eric Fried wrote:
Unless it's likely to be something other than NoValidHost a significant percentage of the time, IMO it...
Well just taking resize, it could be one of many things:
https://github.com/openstack/nova/blob/20.0.0/nova/conductor/manager.py# L366 - oops you tried resizing which would screw up your group affinity policy
https://github.com/openstack/nova/blob/20.0.0/nova/compute/manager.py#L 4490 - (for an admin, cold migrate) oops you tried cold migrating a vcenter vm or you have allow_resize_to_same_host=True and the scheduler picks the same host (silly scheduler, see bug 1748697)
https://github.com/openstack/nova/blob/20.0.0/nova/compute/claims.py#L11 3 - oops you lost a resource claims race, try again
https://github.com/openstack/nova/blob/20.0.0/nova/scheduler/client/report. py#L1898 - oops you lost a race with allocation consumer generation conflicts, try again
--
Thanks,
Matt
On 11/14/2019 2:47 AM, Brin Zhang(张百林) wrote:
I think that should consider of the all instance action operations, such as actions in nova/compute/instance_actions.py.
The resize examples in my email are just examples. The code that generates the action events is centralized in the InstanceActionEvent object so it would be used for all actions that fail with some exception. -- Thanks, Matt
participants (5)
-
Brin Zhang(张百林)
-
Eric Fried
-
Matt Riedemann
-
Sylvain Bauza
-
Tom Barron