On 5/23/2019 9:05 AM, Dan Smith wrote:
Question: do people think we should make the server status field reflect UNKNOWN as well, if the 'host_status' is UNKNOWN? And if so, should it be controlled by policy or no?
Do we have other things that change *value* depending on policy? I was thinking that was one of the situations the policy people (i.e. Matt) have avoided in the past.
Also, AFAIK, our documentation specifies (and existing behavior is) to only return UNKNOWN in the case where we return a partial instance because we couldn't look up the rest of the details from the cell. This would break that relationship, and I'm not sure how people would know that they shouldn't expect a full instance record, other than to poke it with a stick to see if it contains certain properties.
+1 to doing this with a policy. I would prefer giving the ability/choice to the operators to opt-out of it if they want to.
In general, I think we should try to avoid leaking things about the infrastructure to regular users. In the case of a cell being down, we couldn't really fake it because we don't have much of the information available to us. I agree that a host being down is not that different from a cell being down from the perspective of a user, but I also think that allowing operators to opt-in to such a disclosure would be better, although as above, I start to worry about the degrees of freedom in the response.
My biggest concern, which came out during the host status discussion, is that we should *not* say the instance is "down" just because the compute service is unreachable. Saying it's in "unknown" state is better.
I'd like to hear from some more operators about whether they would opt-in to this unknown-state behavior for compute host down-age. Specifically, whether they want customer instances to show as "unknown" state while they're doing an upgrade that otherwise wouldn't impact the instance's health.
--Dan
Agree with Dan that I'd like some operator input on this thread before we consider making a change in behavior. Changing the UNKNOWN status based on down cell vs compute service is down is also confusing as Dan mentions above because vm_state being UNKNOWN is only new as of Stein and is only for the down cell case. With the 'nova list --fields' thing aside, we already have a workaround for this today, right? If I'm an operator and want to expose this information to my users, I configure nova's policy to have: "os_compute_api:servers:show:host_status": "rule:admin_or_owner" And then the user, with the proper microversion, can see the host status if the cloud allows it. As an aside, I now realize we have a nasty performance regression since Stein [1] when listing servers with details concerning this host_status field. The code used to rely on this method [2] to cache the host status information per host when iterating over a list of instances but now it fetches it per host per instance in the view builder [3]. Granted by default policy this would only affect performance for an admin, but if I'm an admin listing 1000 servers across all tenants using "nova list --all-tenants" (which is going to use a microversion high enough to hit this) it could be a noticeable slow down compared to before Stein. I'll open a bug. [1] https://review.opendev.org/#/c/584590/ [2] https://github.com/openstack/nova/blob/c7e9e667426a6d88d396a59cb40d30763a326... [3] https://github.com/openstack/nova/blob/c7e9e667426a6d88d396a59cb40d30763a326... -- Thanks, Matt