tl;dr: What do people think about storing and showing the *type* of exception that is recorded with a failed instance action event (like a fault) to the owner of the server who may not be an admin?
Details:
As noted here [1] and recreated here [2] the instance action event details that a non-admin owner of a server sees do not contain any useful information about what caused the failure of the action. Here is an example of a failed resize from that paste (this is what the non-admin owner of the server would see):
$ openstack --os-compute-api-version 2.51 server event show vm2 req-11487504-da59-411b-b3b8-267bebe9b0d2 -f json -c events { "events": [ { "finish_time": "2019-11-13T16:18:27.000000", "start_time": "2019-11-13T16:18:26.000000", "event": "cold_migrate", "result": "Error" }, { "finish_time": "2019-11-13T16:18:27.000000", "start_time": "2019-11-13T16:18:26.000000", "event": "conductor_migrate_server", "result": "Error" } ] }
Super useful, right?
In this case scheduling failed for the resize so the instance is not in ERROR status which means the user cannot see a fault message with the NoValidHost error either.
The admin can see the traceback in the failed action event list:
$ openstack --os-compute-api-version 2.51 server event show 3ef043ea-e2d7-4565-a401-5c758e149f23 req-11487504-da59-411b-b3b8-267bebe9b0d2 -f json -c events { "events": [ { "finish_time": "2019-11-13T16:18:27.000000", "start_time": "2019-11-13T16:18:26.000000", "traceback": " File "/opt/stack/nova/nova/conductor/manager.py", line 301, in migrate_server\n host_list)\n File "/opt/stack/nova/nova/conductor/manager.py", line 367, in _cold_migrate\n raise exception.NoValidHost(reason=msg)\n", "event": "cold_migrate", "result": "Error" }, { "finish_time": "2019-11-13T16:18:27.000000", "start_time": "2019-11-13T16:18:26.000000", "traceback": " File "/opt/stack/nova/nova/compute/utils.py", line 1411, in decorated_function\n return function(self, context, *args, **kwargs)\n File "/opt/stack/nova/nova/conductor/manager.py", line 301, in migrate_server\n host_list)\n File "/opt/stack/nova/nova/conductor/manager.py", line 367, in _cold_migrate\n raise exception.NoValidHost(reason=msg)\n", "event": "conductor_migrate_server", "result": "Error" } ] }
So when the admin gets the support ticket they can at least tell that scheduling failed and then dig into why.
My idea is to store the exception *type* with the action event, similar to the recorded instance fault message for non-NovaExceptions [3] which will show to the non-admin owner of the server if the server status is ERROR or DELETED [4].
We should record the exc_val to get a prettier message like "No valid host was found." but that could leak details in the error message that we don't want non-admins to see [5].
With what I'm thinking, the non-admin owner of the server could see something like this for a failed event:
{ "finish_time": "2019-11-13T16:18:27.000000", "start_time": "2019-11-13T16:18:26.000000", "event": "cold_migrate", "result": "Error", "details": "NoValidHost" }
That's pretty simple, doesn't leak details, and at least indicates to the user that maybe they can retry the resize with another flavor or something. It's just an example.
This would require a microversion so before writing a spec I wanted to get general feelings about this in the mailing list. I accept that it might not really be worth the effort so that's good feedback if it's how you feel (I'll only cry a little).
[1] https://review.opendev.org/#/c/693937/2/nova/objects/instance_action.py [2] http://paste.openstack.org/show/786054/ [3] https://github.com/openstack/nova/blob/20.0.0/nova/compute/utils.py#L101 [4] https://github.com/openstack/nova/blob/20.0.0/nova/api/openstack/compute/vie... [5] https://bugs.launchpad.net/nova/+bug/1851587