Re: [nova][dev][ops] server status when compute host is down

23 May 2019

      On 5/23/2019 9:05 AM, Dan Smith wrote:
...
...
Question: do people think we should make the server status field
  reflect UNKNOWN as well, if the 'host_status' is UNKNOWN? And if so,
  should it be controlled by policy or no?
Do we have other things that change *value* depending on policy? I was
thinking that was one of the situations the policy people (i.e. Matt)
have avoided in the past.
Also, AFAIK, our documentation specifies (and existing behavior is) to
only return UNKNOWN in the case where we return a partial instance
because we couldn't look up the rest of the details from the cell. This
would break that relationship, and I'm not sure how people would know
that they shouldn't expect a full instance record, other than to poke it
with a stick to see if it contains certain properties.
...
+1 to doing this with a policy. I would prefer giving the
ability/choice to the operators to opt-out of it if they want to.
In general, I think we should try to avoid leaking things about the
infrastructure to regular users. In the case of a cell being down, we
couldn't really fake it because we don't have much of the information
available to us. I agree that a host being down is not that different
from a cell being down from the perspective of a user, but I also think
that allowing operators to opt-in to such a disclosure would be better,
although as above, I start to worry about the degrees of freedom in the
response.
My biggest concern, which came out during the host status discussion, is
that we should *not* say the instance is "down" just because the compute
service is unreachable. Saying it's in "unknown" state is better.
I'd like to hear from some more operators about whether they would
opt-in to this unknown-state behavior for compute host
down-age. Specifically, whether they want customer instances to show as
"unknown" state while they're doing an upgrade that otherwise wouldn't
impact the instance's health.
--Dan
Agree with Dan that I'd like some operator input on this thread before 
we consider making a change in behavior.

Changing the UNKNOWN status based on down cell vs compute service is 
down is also confusing as Dan mentions above because vm_state being 
UNKNOWN is only new as of Stein and is only for the down cell case.

With the 'nova list --fields' thing aside, we already have a workaround 
for this today, right? If I'm an operator and want to expose this 
information to my users, I configure nova's policy to have:

"os_compute_api:servers:show:host_status": "rule:admin_or_owner"

And then the user, with the proper microversion, can see the host status 
if the cloud allows it.

As an aside, I now realize we have a nasty performance regression since 
Stein [1] when listing servers with details concerning this host_status 
field. The code used to rely on this method [2] to cache the host status 
information per host when iterating over a list of instances but now it 
fetches it per host per instance in the view builder [3]. Granted by 
default policy this would only affect performance for an admin, but if 
I'm an admin listing 1000 servers across all tenants using "nova list 
--all-tenants" (which is going to use a microversion high enough to hit 
this) it could be a noticeable slow down compared to before Stein. I'll 
open a bug.

[1] https://review.opendev.org/#/c/584590/
[2] 
https://github.com/openstack/nova/blob/c7e9e667426a6d88d396a59cb40d30763a326...
[3] 
https://github.com/openstack/nova/blob/c7e9e667426a6d88d396a59cb40d30763a326...

-- 

Thanks,

Matt

Re: [nova][dev][ops] server status when compute host is down

Matt Riedemann