[nova][dev][ops] server status when compute host is down

newer
[telemetry] Cancel team meeting...

melanie witt

23 May 2019 23 May '19

4:58 a.m.

Hey all, I'm looking for feedback around whether we can improve how we show server status in server list and server show when the compute host it resides on is down. When a compute host goes down while a server on it was previously running, the server status continues to show as ACTIVE in a server list. This is because the power state and status is adjusted by a periodic task run by nova-compute, so if nova-compute is down, it cannot update those states. So, for an end user, when they do a server list, they see their server as ACTIVE when it's actually powered off. We have another field called 'host_status' available since API microversion 2.16 [1] which is controlled by policy and defaults to admin, which is capable of showing the server status as UNKNOWN if the field is specified, for example: nova list --fields id,name,status,task_state,power_state,networks,host_status This is cool, but it is only available to admin by default, and it requires that the end user adds the field to their CLI command in the --fields option. Question: do people think we should make the server status field reflect UNKNOWN as well, if the 'host_status' is UNKNOWN? And if so, should it be controlled by policy or no? Normally, we do not expose compute host details to non-admin in the API by default, but I noticed recently that our "down cells" support will show server status as UNKNOWN if a server is in a down cell [2]. So I wondered if it would be considered OK to show UNKNOWN if a host is down we well, without defaulting it to admin-only. I would really appreciate if people could share their opinion here and if consensus is in support, I will move forward with proposing a change accordingly. Cheers, -melanie [1] https://docs.openstack.org/nova/latest/reference/api-microversion-history.ht... [2] https://github.com/openstack/nova/blob/66a77f2fb75bbb9daebdca1cad0255ecafe41...

Show replies by date

Matthew Booth

23 May 23 May

1:11 p.m.

On Thu, 23 May 2019 at 03:02, melanie witt <melwittt@gmail.com> wrote:

...

Hey all,

I'm looking for feedback around whether we can improve how we show server status in server list and server show when the compute host it resides on is down.

When a compute host goes down while a server on it was previously running, the server status continues to show as ACTIVE in a server list. This is because the power state and status is adjusted by a periodic task run by nova-compute, so if nova-compute is down, it cannot update those states.

So, for an end user, when they do a server list, they see their server as ACTIVE when it's actually powered off.

We have another field called 'host_status' available since API microversion 2.16 [1] which is controlled by policy and defaults to admin, which is capable of showing the server status as UNKNOWN if the field is specified, for example:

nova list --fields id,name,status,task_state,power_state,networks,host_status

This is cool, but it is only available to admin by default, and it requires that the end user adds the field to their CLI command in the --fields option.

Question: do people think we should make the server status field reflect UNKNOWN as well, if the 'host_status' is UNKNOWN? And if so, should it be controlled by policy or no?

Normally, we do not expose compute host details to non-admin in the API by default, but I noticed recently that our "down cells" support will show server status as UNKNOWN if a server is in a down cell [2]. So I wondered if it would be considered OK to show UNKNOWN if a host is down we well, without defaulting it to admin-only.

+1 from me. This seems to have confused users in the past and honest is better than potentially wrong, imho. I can't think of a reason why this information 'leak' would cause any problems. Can anybody else? Matt

...

I would really appreciate if people could share their opinion here and if consensus is in support, I will move forward with proposing a change accordingly.

Cheers, -melanie

[1] https://docs.openstack.org/nova/latest/reference/api-microversion-history.ht... [2] https://github.com/openstack/nova/blob/66a77f2fb75bbb9daebdca1cad0255ecafe41...

-- Matthew Booth Red Hat OpenStack Engineer, Compute DFG Phone: +442070094448 (UK)

Eric Fried

1:36 p.m.

+1 from me too.

...

I can't think of a reason why this information 'leak' would cause any problems. Can anybody else?

Me neither. But if controlled by policy, the paranoid admin can decide. efried

iain.macdonnell＠oracle.com

8:26 p.m.

On 5/23/19 3:11 AM, Matthew Booth wrote:

...

On Thu, 23 May 2019 at 03:02, melanie witt <melwittt@gmail.com> wrote:

...
Hey all,

I'm looking for feedback around whether we can improve how we show server status in server list and server show when the compute host it resides on is down.

When a compute host goes down while a server on it was previously running, the server status continues to show as ACTIVE in a server list. This is because the power state and status is adjusted by a periodic task run by nova-compute, so if nova-compute is down, it cannot update those states.

So, for an end user, when they do a server list, they see their server as ACTIVE when it's actually powered off.

We have another field called 'host_status' available since API microversion 2.16 [1] which is controlled by policy and defaults to admin, which is capable of showing the server status as UNKNOWN if the field is specified, for example:

nova list --fields id,name,status,task_state,power_state,networks,host_status

This is cool, but it is only available to admin by default, and it requires that the end user adds the field to their CLI command in the --fields option.

Question: do people think we should make the server status field reflect UNKNOWN as well, if the 'host_status' is UNKNOWN? And if so, should it be controlled by policy or no?

Normally, we do not expose compute host details to non-admin in the API by default, but I noticed recently that our "down cells" support will show server status as UNKNOWN if a server is in a down cell [2]. So I wondered if it would be considered OK to show UNKNOWN if a host is down we well, without defaulting it to admin-only.

+1 from me. This seems to have confused users in the past and honest is better than potentially wrong, imho. I can't think of a reason why this information 'leak' would cause any problems. Can anybody else?

Agreed. I don't think that a server status of "UNKNOWN" really constitutes "exposing compute host details". It's not sharing anything about *why* the server status is unknown - it's just not pretending that the last known status is still valid, when that may or may not actually be true. Or is the proposal to expose host_status where it would not normally be visible? It seems that the the down-host scenario is basically the same as down-cell, as far as being able to ascertain server status, so it seems to make sense to use the same indicator. ~iain

Balázs Gibizer

2:26 p.m.

On Thu, May 23, 2019 at 3:58 AM, melanie witt <melwittt@gmail.com> wrote:

...

Question: do people think we should make the server status field reflect UNKNOWN as well, if the 'host_status' is UNKNOWN? And if so, should it be controlled by policy or no?

Works for me. Cheers, gibi

Surya Seetharaman

2:39 p.m.

On Thu, May 23, 2019 at 3:59 AM melanie witt <melwittt@gmail.com> wrote:

...

Hey all,

Question: do people think we should make the server status field reflect UNKNOWN as well, if the 'host_status' is UNKNOWN? And if so, should it be controlled by policy or no?

+1 to doing this with a policy. I would prefer giving the ability/choice to the operators to opt-out of it if they want to. ----------- Regards, Surya.

Dan Smith

5:05 p.m.

...

Question: do people think we should make the server status field reflect UNKNOWN as well, if the 'host_status' is UNKNOWN? And if so, should it be controlled by policy or no?

Do we have other things that change *value* depending on policy? I was thinking that was one of the situations the policy people (i.e. Matt) have avoided in the past. Also, AFAIK, our documentation specifies (and existing behavior is) to only return UNKNOWN in the case where we return a partial instance because we couldn't look up the rest of the details from the cell. This would break that relationship, and I'm not sure how people would know that they shouldn't expect a full instance record, other than to poke it with a stick to see if it contains certain properties.

...

+1 to doing this with a policy. I would prefer giving the ability/choice to the operators to opt-out of it if they want to.

In general, I think we should try to avoid leaking things about the infrastructure to regular users. In the case of a cell being down, we couldn't really fake it because we don't have much of the information available to us. I agree that a host being down is not that different from a cell being down from the perspective of a user, but I also think that allowing operators to opt-in to such a disclosure would be better, although as above, I start to worry about the degrees of freedom in the response. My biggest concern, which came out during the host status discussion, is that we should *not* say the instance is "down" just because the compute service is unreachable. Saying it's in "unknown" state is better. I'd like to hear from some more operators about whether they would opt-in to this unknown-state behavior for compute host down-age. Specifically, whether they want customer instances to show as "unknown" state while they're doing an upgrade that otherwise wouldn't impact the instance's health. --Dan

Matt Riedemann

7:50 p.m.

On 5/23/2019 9:05 AM, Dan Smith wrote:

...

...
Question: do people think we should make the server status field reflect UNKNOWN as well, if the 'host_status' is UNKNOWN? And if so, should it be controlled by policy or no?

Do we have other things that change *value* depending on policy? I was thinking that was one of the situations the policy people (i.e. Matt) have avoided in the past.

Also, AFAIK, our documentation specifies (and existing behavior is) to only return UNKNOWN in the case where we return a partial instance because we couldn't look up the rest of the details from the cell. This would break that relationship, and I'm not sure how people would know that they shouldn't expect a full instance record, other than to poke it with a stick to see if it contains certain properties.

...
+1 to doing this with a policy. I would prefer giving the ability/choice to the operators to opt-out of it if they want to.

In general, I think we should try to avoid leaking things about the infrastructure to regular users. In the case of a cell being down, we couldn't really fake it because we don't have much of the information available to us. I agree that a host being down is not that different from a cell being down from the perspective of a user, but I also think that allowing operators to opt-in to such a disclosure would be better, although as above, I start to worry about the degrees of freedom in the response.

My biggest concern, which came out during the host status discussion, is that we should *not* say the instance is "down" just because the compute service is unreachable. Saying it's in "unknown" state is better.

I'd like to hear from some more operators about whether they would opt-in to this unknown-state behavior for compute host down-age. Specifically, whether they want customer instances to show as "unknown" state while they're doing an upgrade that otherwise wouldn't impact the instance's health.

--Dan

Agree with Dan that I'd like some operator input on this thread before we consider making a change in behavior. Changing the UNKNOWN status based on down cell vs compute service is down is also confusing as Dan mentions above because vm_state being UNKNOWN is only new as of Stein and is only for the down cell case. With the 'nova list --fields' thing aside, we already have a workaround for this today, right? If I'm an operator and want to expose this information to my users, I configure nova's policy to have: "os_compute_api:servers:show:host_status": "rule:admin_or_owner" And then the user, with the proper microversion, can see the host status if the cloud allows it. As an aside, I now realize we have a nasty performance regression since Stein [1] when listing servers with details concerning this host_status field. The code used to rely on this method [2] to cache the host status information per host when iterating over a list of instances but now it fetches it per host per instance in the view builder [3]. Granted by default policy this would only affect performance for an admin, but if I'm an admin listing 1000 servers across all tenants using "nova list --all-tenants" (which is going to use a microversion high enough to hit this) it could be a noticeable slow down compared to before Stein. I'll open a bug. [1] https://review.opendev.org/#/c/584590/ [2] https://github.com/openstack/nova/blob/c7e9e667426a6d88d396a59cb40d30763a326... [3] https://github.com/openstack/nova/blob/c7e9e667426a6d88d396a59cb40d30763a326... -- Thanks, Matt

Matt Riedemann

8 p.m.

On 5/23/2019 11:50 AM, Matt Riedemann wrote:

...

As an aside, I now realize we have a nasty performance regression since Stein [1] when listing servers with details concerning this host_status field. The code used to rely on this method [2] to cache the host status information per host when iterating over a list of instances but now it fetches it per host per instance in the view builder [3]. Granted by default policy this would only affect performance for an admin, but if I'm an admin listing 1000 servers across all tenants using "nova list --all-tenants" (which is going to use a microversion high enough to hit this) it could be a noticeable slow down compared to before Stein. I'll open a bug.

[1] https://review.opendev.org/#/c/584590/ [2] https://github.com/openstack/nova/blob/c7e9e667426a6d88d396a59cb40d30763a326...

[3] https://github.com/openstack/nova/blob/c7e9e667426a6d88d396a59cb40d30763a326...

https://bugs.launchpad.net/nova/+bug/1830260 -- Thanks, Matt

Matt Riedemann

9:32 p.m.

On 5/22/2019 8:58 PM, melanie witt wrote:

...

So, for an end user, when they do a server list, they see their server as ACTIVE when it's actually powered off.

Well, it might be powered off, we don't know. If nova-compute is down the guest could still be running if the hypervisor is running.

...

We have another field called 'host_status' available since API microversion 2.16 [1] which is controlled by policy and defaults to admin, which is capable of showing the server status as UNKNOWN if the field is specified, for example:

nova list --fields id,name,status,task_state,power_state,networks,host_status

This is cool, but it is only available to admin by default, and it requires that the end user adds the field to their CLI command in the --fields option.

As I said elsewhere in this thread, if you're proposing to add a new policy rule to change the 'status' field based on host_status, why not just tell people to open up the policy rule we already have for the host_status field so non-admins can see it in their server details? This sounds like an education problem more than a technical problem to me. Also, --fields is one thing on one interface to the API. Microversions are opt-in on purpose to avoid backward incompatible and behavior changes to the client, so if the client has a need to know this information, they can opt into getting it via the host_status field by using the 2.16 microversion or higher. That's the case for any microversion that adds new fields like the embedded instance.flavor details in 2.47 - we didn't just say "let's add a new policy rule to expose those details".

...

Question: do people think we should make the server status field reflect UNKNOWN as well, if the 'host_status' is UNKNOWN? And if so, should it be controlled by policy or no?

I'm going to vote no given we have a way to determine this already, as noted above.

...

Normally, we do not expose compute host details to non-admin in the API by default, but I noticed recently that our "down cells" support will show server status as UNKNOWN if a server is in a down cell [2]. So I wondered if it would be considered OK to show UNKNOWN if a host is down we well, without defaulting it to admin-only.

The down-cell UNKNOWN stuff is also opt-in behavior using the 2.69 microversion. I would likely only get behind changing the behavior of the 'status' field based on the compute service status in a new microversion, and then we have to talk about whether or not the response should mirror the down-cell case where we return partial results. That all sounds like a lot more work than just educating people about the host_status field and the existing policy rule to expose it. -- Thanks, Matt

Dan Smith

9:47 p.m.

...

As I said elsewhere in this thread, if you're proposing to add a new policy rule to change the 'status' field based on host_status, why not just tell people to open up the policy rule we already have for the host_status field so non-admins can see it in their server details? This sounds like an education problem more than a technical problem to me.

Yeah, I'm much more in favor of this, unsurprisingly. It also avoids the case where a script is polling for an instance's state, and if it becomes anything other than ACTIVE, it takes action or wakes someone up. If you've just taken the compute service down for an upgrade (or rabbit took a dump) you don't end up freaking out because "the instance has changed state" which is what that looks like from the outside. If you _want_ to take action based on the host's state, then you look at that attribute (if allowed) and make decisions thusly.

...

Also, --fields is one thing on one interface to the API. Microversions are opt-in on purpose to avoid backward incompatible and behavior changes to the client, so if the client has a need to know this information, they can opt into getting it via the host_status field by using the 2.16 microversion or higher. That's the case for any microversion that adds new fields like the embedded instance.flavor details in 2.47 - we didn't just say "let's add a new policy rule to expose those details".

Clearly we couldn't return the UNKNOWN state if the request was from before whatever microversion we enable this in.

...

The down-cell UNKNOWN stuff is also opt-in behavior using the 2.69 microversion. I would likely only get behind changing the behavior of the 'status' field based on the compute service status in a new microversion, and then we have to talk about whether or not the response should mirror the down-cell case where we return partial results. That all sounds like a lot more work than just educating people about the host_status field and the existing policy rule to expose it.

I actually think if we're going to do this, we *should* make compute-down mirror cell-down in terms of what we return. I think that's unfortunate, mind you, but otherwise we'd be effectively re-writing what we said in the down-cell microversion, going from "If it's UNKNOWN, expect the instance to look like the minimal version" to "Well, that depends...". It would mean that something using the later microversion would no longer be able to check for UNKNOWN to determine if there's a full instance to look at, and instead would have to poke for keys. --Dan

Matt Riedemann

9:53 p.m.

On 5/23/2019 1:47 PM, Dan Smith wrote:

...

It also avoids the case where a script is polling for an instance's state, and if it becomes anything other than ACTIVE, it takes action or wakes someone up. If you've just taken the compute service down for an upgrade (or rabbit took a dump) you don't end up freaking out because "the instance has changed state" which is what that looks like from the outside. If you_want_ to take action based on the host's state, then you look at that attribute (if allowed) and make decisions thusly.

This raises another concern - if the UNKNOWN status is not baked into the instance.vm_state itself, then what do you do about notifications that nova is sending out? Would those also be checking the host status and changing the instance status in the notification payload to UNKNOWN? Anyway, it's stuff like this that requires a lot more thought than just deciding on a whim that we'd like some behavior change in the API (note the original email nor any of the people agreeing with it in this thread said anything about a new microversion). Rather than deal with all of these side effects, just explain to people that need this information how to configure their cloud to expose it and how to write their client side tooling to get it. -- Thanks, Matt

iain.macdonnell＠oracle.com

9:56 p.m.

On 5/23/19 11:32 AM, Matt Riedemann wrote:

...

As I said elsewhere in this thread, if you're proposing to add a new policy rule to change the 'status' field based on host_status, why not just tell people to open up the policy rule we already have for the host_status field so non-admins can see it in their server details? This sounds like an education problem more than a technical problem to me.

Because *that* implies revealing infrastructure status details to end-users, which is probably not desirable in a lot of cases. Isn't this as simple as not lying to the user about the *server* status when it cannot be ascertained for any reason? In that case, the user should be given (only) that information, but not any "dirty laundry" about what caused it.... Even if the admin doesn't care about revealing infrastructure status, the end-user shouldn't have to know that server_status can't be trusted, and that they have to check other fields to figure out if it's reliable or not at any given time. ~iain

melanie witt

11:08 p.m.

On Thu, 23 May 2019 11:56:34 -0700, Iain Macdonnell <iain.macdonnell@oracle.com> wrote:

...

On 5/23/19 11:32 AM, Matt Riedemann wrote:

...
As I said elsewhere in this thread, if you're proposing to add a new policy rule to change the 'status' field based on host_status, why not just tell people to open up the policy rule we already have for the host_status field so non-admins can see it in their server details? This sounds like an education problem more than a technical problem to me.

Because *that* implies revealing infrastructure status details to end-users, which is probably not desirable in a lot of cases.

This is a good point. If an operator were to enable 'host_status' via policy, end users would also get to see host_status UP and DOWN, which is typically not desired by cloud admins. There's currently no option for exposing only UNKNOWN, as a small but helpful bit of info for end users.

...

Isn't this as simple as not lying to the user about the *server* status when it cannot be ascertained for any reason? In that case, the user should be given (only) that information, but not any "dirty laundry" about what caused it....

Even if the admin doesn't care about revealing infrastructure status, the end-user shouldn't have to know that server_status can't be trusted, and that they have to check other fields to figure out if it's reliable or not at any given time.

And yes, I was thinking about it more simply, and the replies on this thread have led me to think that if we could show the cosmetic-only status of UNKNOWN for nova-compute communication interruptions, similar to what we do for down cells, we would not put a policy control on it (since UNKNOWN is not leaking infra details). And not make any changes to notifications etc, just a cosmetic-only UNKNOWN status implemented at the REST API layer if host_status is UNKNOWN. I was thinking maybe we'd leave server status alone if host_status is UP or DOWN since its status should be reflected in those cases as-is. Assuming we could move forward without a policy control on it, I think the only remaining concern would be the collision of UNKNOWN status with down cells where for down cells, some server attributes are not available. Personally, this doesn't seem like a major problem to me since UNKNOWN implies an uncertain state, in general. But maybe I'm wrong. How important is the difference? Finally, it sounds like the consensus is that if we do decide to make this change, we would need a new microversion to account for server status being able to be UNKNOWN if host_status is UNKNOWN. -melanie

melanie witt

19 Jun 19 Jun

1:40 a.m.

On 5/23/19 1:08 PM, melanie witt wrote:

...

On Thu, 23 May 2019 11:56:34 -0700, Iain Macdonnell <iain.macdonnell@oracle.com> wrote:

...
On 5/23/19 11:32 AM, Matt Riedemann wrote:

...
As I said elsewhere in this thread, if you're proposing to add a new policy rule to change the 'status' field based on host_status, why not just tell people to open up the policy rule we already have for the host_status field so non-admins can see it in their server details? This sounds like an education problem more than a technical problem to me.

Because *that* implies revealing infrastructure status details to end-users, which is probably not desirable in a lot of cases.

This is a good point. If an operator were to enable 'host_status' via policy, end users would also get to see host_status UP and DOWN, which is typically not desired by cloud admins. There's currently no option for exposing only UNKNOWN, as a small but helpful bit of info for end users.

...
Isn't this as simple as not lying to the user about the *server* status when it cannot be ascertained for any reason? In that case, the user should be given (only) that information, but not any "dirty laundry" about what caused it....

Even if the admin doesn't care about revealing infrastructure status, the end-user shouldn't have to know that server_status can't be trusted, and that they have to check other fields to figure out if it's reliable or not at any given time.

And yes, I was thinking about it more simply, and the replies on this thread have led me to think that if we could show the cosmetic-only status of UNKNOWN for nova-compute communication interruptions, similar to what we do for down cells, we would not put a policy control on it (since UNKNOWN is not leaking infra details). And not make any changes to notifications etc, just a cosmetic-only UNKNOWN status implemented at the REST API layer if host_status is UNKNOWN. I was thinking maybe we'd leave server status alone if host_status is UP or DOWN since its status should be reflected in those cases as-is.

Assuming we could move forward without a policy control on it, I think the only remaining concern would be the collision of UNKNOWN status with down cells where for down cells, some server attributes are not available. Personally, this doesn't seem like a major problem to me since UNKNOWN implies an uncertain state, in general. But maybe I'm wrong. How important is the difference?

Finally, it sounds like the consensus is that if we do decide to make this change, we would need a new microversion to account for server status being able to be UNKNOWN if host_status is UNKNOWN.

FYI, I've proposed a spec here: https://review.opendev.org/666181 -melanie

2180

Age (days ago)

2206

Last active (days ago)

List overview

Download

14 comments

8 participants

participants (8)

Balázs Gibizer
Dan Smith
Eric Fried
iain.macdonnell＠oracle.com
Matt Riedemann
Matthew Booth
melanie witt
Surya Seetharaman