[nova] Validation for requested host/node on server create
It seems we've come to an impasse on this change [1] because of a concern about where to validate the requested host and/or hypervisor_hostname. The change is currently validating in the API to provide a fast fail 400 response to the user if the host and/or node don't exist. The concern is that the lookup for the compute node will be done again in the scheduler [2]. Also, if the host is not provided, then to validate the node we have to iterate the cells looking for the given compute node (we could use placement though, more on that below). I've added this to the nova meeting agenda for tomorrow but wanted to try and enumerate what I see are the options before the meeting so we don't have to re-cap all of this during the meeting. The options as I see them are: 1. Omit the validation in the API and let the scheduler do the validation. Pros: no performance impact in the API when creating server(s) Cons: if the host/node does not exist, the user will get a 202 response and eventually a NoValidHost error which is not a great user experience, although it is what happens today with the availability_zone forced host/node behavior we already have [3] so maybe it's acceptable. 2. Only validate host in the API since we can look up the HostMapping in the API DB. If the user also provided a node then we'd just throw that on the RequestSpec and let the scheduler code validate it. Pros: basic validation for the simple and probably most widely used case since for non-baremetal instances the host and node are going to be the same Cons: still could have a late failure in the scheduler with NoValidHost error; does not cover the case that only node (no host) is specified 3. Validate both the host and node in the API. This can be broken down: a) If only host is specified, do #2 above. b) If only node is specified, iterate the cells looking for the node (or query a resource provider with that name in placement which would avoid down cell issues) c) If both host and node is specified, get the HostMapping and from that lookup the ComputeNode in the given cell (per the HostMapping) Pros: fail fast behavior in the API if either the host and/or node do not exist Cons: performance hit in the API to validate the host/node and redundancy with the scheduler to find the ComputeNode to get its uuid for the in_tree filtering on GET /allocation_candidates. Note that if we do find the ComputeNode in the API, we could also (later?) make a change to the Destination object to add a node_uuid field so we can pass that through on the RequestSpec from API->conductor->scheduler and that should remove the need for the duplicate query in the scheduler code for the in_tree logic. I'm personally in favor of option 3 since we know that users hate NoValidHost errors and we have ways to mitigate the performance overhead of that validation. Note that this isn't necessarily something that has to happen in the same change that introduces the host/hypervisor_hostname parameters to the API. If we do the validation in the API I'd probably split the validation logic into it's own patch to make it easier to test and review on its own. [1] https://review.opendev.org/#/c/645520/ [2] https://github.com/openstack/nova/blob/2e85453879533af0b4d0e1178797d26f026a9... [3] https://docs.openstack.org/nova/latest/admin/availability-zones.html -- Thanks, Matt
On Wed, 22 May 2019 17:13:48 -0500, Matt Riedemann <mriedemos@gmail.com> wrote:
It seems we've come to an impasse on this change [1] because of a concern about where to validate the requested host and/or hypervisor_hostname.
The change is currently validating in the API to provide a fast fail 400 response to the user if the host and/or node don't exist. The concern is that the lookup for the compute node will be done again in the scheduler [2]. Also, if the host is not provided, then to validate the node we have to iterate the cells looking for the given compute node (we could use placement though, more on that below).
I've added this to the nova meeting agenda for tomorrow but wanted to try and enumerate what I see are the options before the meeting so we don't have to re-cap all of this during the meeting.
The options as I see them are:
1. Omit the validation in the API and let the scheduler do the validation.
Pros: no performance impact in the API when creating server(s)
Cons: if the host/node does not exist, the user will get a 202 response and eventually a NoValidHost error which is not a great user experience, although it is what happens today with the availability_zone forced host/node behavior we already have [3] so maybe it's acceptable.
2. Only validate host in the API since we can look up the HostMapping in the API DB. If the user also provided a node then we'd just throw that on the RequestSpec and let the scheduler code validate it.
Pros: basic validation for the simple and probably most widely used case since for non-baremetal instances the host and node are going to be the same
Cons: still could have a late failure in the scheduler with NoValidHost error; does not cover the case that only node (no host) is specified
3. Validate both the host and node in the API. This can be broken down:
a) If only host is specified, do #2 above. b) If only node is specified, iterate the cells looking for the node (or query a resource provider with that name in placement which would avoid down cell issues) c) If both host and node is specified, get the HostMapping and from that lookup the ComputeNode in the given cell (per the HostMapping)
Pros: fail fast behavior in the API if either the host and/or node do not exist
Cons: performance hit in the API to validate the host/node and redundancy with the scheduler to find the ComputeNode to get its uuid for the in_tree filtering on GET /allocation_candidates.
Note that if we do find the ComputeNode in the API, we could also (later?) make a change to the Destination object to add a node_uuid field so we can pass that through on the RequestSpec from API->conductor->scheduler and that should remove the need for the duplicate query in the scheduler code for the in_tree logic.
I'm personally in favor of option 3 since we know that users hate NoValidHost errors and we have ways to mitigate the performance overhead of that validation.
Count me in the option 3 boat too, for the same reasons. Rather avoid NoValidHost and there's mitigation we can do for the perf issue. -melanie
Note that this isn't necessarily something that has to happen in the same change that introduces the host/hypervisor_hostname parameters to the API. If we do the validation in the API I'd probably split the validation logic into it's own patch to make it easier to test and review on its own.
[1] https://review.opendev.org/#/c/645520/ [2] https://github.com/openstack/nova/blob/2e85453879533af0b4d0e1178797d26f026a9... [3] https://docs.openstack.org/nova/latest/admin/availability-zones.html
On Thu, May 23, 2019 at 12:16 AM Matt Riedemann <mriedemos@gmail.com> wrote:
1. Omit the validation in the API and let the scheduler do the validation.
Pros: no performance impact in the API when creating server(s)
Cons: if the host/node does not exist, the user will get a 202 response and eventually a NoValidHost error which is not a great user experience, although it is what happens today with the availability_zone forced host/node behavior we already have [3] so maybe it's acceptable.
What I had in mind when suggesting this was to actually return a Host/NodeNotFound exception from the host_manager [1] instead of confusing that with the NoValidHost exception when its actually not a NoValidHost (as this is usually associated with host capacity) if the host or node doesn't exist. I know that it has already been implemented as a NoValidHost [2] but we could change this.
3. Validate both the host and node in the API. This can be broken down:
a) If only host is specified, do #2 above. b) If only node is specified, iterate the cells looking for the node (or query a resource provider with that name in placement which would avoid down cell issues) c) If both host and node is specified, get the HostMapping and from that lookup the ComputeNode in the given cell (per the HostMapping)
Pros: fail fast behavior in the API if either the host and/or node do not exist
Cons: performance hit in the API to validate the host/node and redundancy with the scheduler to find the ComputeNode to get its uuid for the in_tree filtering on GET /allocation_candidates.
I don't mind if we did this as long as don't hit all the cells twice (in api and scheduler) which like you said could be avoided by going to placement.
Note that if we do find the ComputeNode in the API, we could also (later?) make a change to the Destination object to add a node_uuid field so we can pass that through on the RequestSpec from API->conductor->scheduler and that should remove the need for the duplicate query in the scheduler code for the in_tree logic.
I guess we discussed this in a similar(ly different) situation and decided against it [3]. [1] https://github.com/openstack/nova/blob/c7e9e667426a6d88d396a59cb40d30763a326... [2] https://github.com/openstack/nova/blob/c7e9e667426a6d88d396a59cb40d30763a326... [3] https://review.opendev.org/#/c/646029/4/specs/train/approved/use-placement-i... -- Regards, Surya.
On 5/23/2019 9:46 AM, Surya Seetharaman wrote:
1. Omit the validation in the API and let the scheduler do the validation.
Pros: no performance impact in the API when creating server(s)
Cons: if the host/node does not exist, the user will get a 202 response and eventually a NoValidHost error which is not a great user experience, although it is what happens today with the availability_zone forced host/node behavior we already have [3] so maybe it's acceptable.
What I had in mind when suggesting this was to actually return a Host/NodeNotFound exception from the host_manager [1] instead of confusing that with the NoValidHost exception when its actually not a NoValidHost (as this is usually associated with host capacity) if the host or node doesn't exist. I know that it has already been implemented as a NoValidHost [2] but we could change this.
The point is by the time we hit this, we've given the user a 202 and eventually scheduling is going to fail. It doesn't matter if it's NoValidHost or HostNotFound or NodeNotFound or MyToiletIsBroken, the server is going to go to ERROR state and the user has to figure out why from the fault information which is poor UX IMO.
3. Validate both the host and node in the API. This can be broken down:
a) If only host is specified, do #2 above. b) If only node is specified, iterate the cells looking for the node (or query a resource provider with that name in placement which would avoid down cell issues) c) If both host and node is specified, get the HostMapping and from that lookup the ComputeNode in the given cell (per the HostMapping)
Pros: fail fast behavior in the API if either the host and/or node do not exist
Cons: performance hit in the API to validate the host/node and redundancy with the scheduler to find the ComputeNode to get its uuid for the in_tree filtering on GET /allocation_candidates.
I don't mind if we did this as long as don't hit all the cells twice (in api and scheduler) which like you said could be avoided by going to placement.
Yeah I think we can more efficiently check for the node using placement (this was Sean Mooney's idea while talking about it yesterday in IRC).
Note that if we do find the ComputeNode in the API, we could also (later?) make a change to the Destination object to add a node_uuid field so we can pass that through on the RequestSpec from API->conductor->scheduler and that should remove the need for the duplicate query in the scheduler code for the in_tree logic.
I guess we discussed this in a similar(ly different) situation and decided against it [3].
I'm having a hard time dredging up the context on that conversation but unless I'm mistaken I think that was talking about the RequestGroup vs the Destination object. Because of when and where the RequestGroup stuff happens today, we can't really use that from the API to set in_tree early, which is why the API code is only setting the RequestSpec.requested_destination (Destination object) field with the host/node values. -- Thanks, Matt
On 5/22/2019 5:13 PM, Matt Riedemann wrote:
3. Validate both the host and node in the API. This can be broken down:
a) If only host is specified, do #2 above. b) If only node is specified, iterate the cells looking for the node (or query a resource provider with that name in placement which would avoid down cell issues) c) If both host and node is specified, get the HostMapping and from that lookup the ComputeNode in the given cell (per the HostMapping)
Pros: fail fast behavior in the API if either the host and/or node do not exist
Cons: performance hit in the API to validate the host/node and redundancy with the scheduler to find the ComputeNode to get its uuid for the in_tree filtering on GET /allocation_candidates.
Note that if we do find the ComputeNode in the API, we could also (later?) make a change to the Destination object to add a node_uuid field so we can pass that through on the RequestSpec from API->conductor->scheduler and that should remove the need for the duplicate query in the scheduler code for the in_tree logic.
I'm personally in favor of option 3 since we know that users hate NoValidHost errors and we have ways to mitigate the performance overhead of that validation.
Note that this isn't necessarily something that has to happen in the same change that introduces the host/hypervisor_hostname parameters to the API. If we do the validation in the API I'd probably split the validation logic into it's own patch to make it easier to test and review on its own.
[1] https://review.opendev.org/#/c/645520/ [2] https://github.com/openstack/nova/blob/2e85453879533af0b4d0e1178797d26f026a9...
[3] https://docs.openstack.org/nova/latest/admin/availability-zones.html
Per the nova meeting today [1] it sounds like we're going to go with option 3 and do the validation in the API - check hostmapping for the host, check placement for the node, we can optimize the redundant scheduler calculation for in_tree later. For review and test sanity I ask that the API validation code comes in a separate patch in the series. [1] http://eavesdrop.openstack.org/meetings/nova/2019/nova.2019-05-23-21.00.log.... -- Thanks, Matt
---- On Fri, 24 May 2019 07:02:15 +0900 Matt Riedemann <mriedemos@gmail.com> wrote ----
On 5/22/2019 5:13 PM, Matt Riedemann wrote:
3. Validate both the host and node in the API. This can be broken down:
a) If only host is specified, do #2 above. b) If only node is specified, iterate the cells looking for the node (or query a resource provider with that name in placement which would avoid down cell issues) c) If both host and node is specified, get the HostMapping and from that lookup the ComputeNode in the given cell (per the HostMapping)
Pros: fail fast behavior in the API if either the host and/or node do not exist
Cons: performance hit in the API to validate the host/node and redundancy with the scheduler to find the ComputeNode to get its uuid for the in_tree filtering on GET /allocation_candidates.
Note that if we do find the ComputeNode in the API, we could also (later?) make a change to the Destination object to add a node_uuid field so we can pass that through on the RequestSpec from API->conductor->scheduler and that should remove the need for the duplicate query in the scheduler code for the in_tree logic.
I'm personally in favor of option 3 since we know that users hate NoValidHost errors and we have ways to mitigate the performance overhead of that validation.
Note that this isn't necessarily something that has to happen in the same change that introduces the host/hypervisor_hostname parameters to the API. If we do the validation in the API I'd probably split the validation logic into it's own patch to make it easier to test and review on its own.
[1] https://review.opendev.org/#/c/645520/ [2] https://github.com/openstack/nova/blob/2e85453879533af0b4d0e1178797d26f026a9...
[3] https://docs.openstack.org/nova/latest/admin/availability-zones.html
Per the nova meeting today [1] it sounds like we're going to go with option 3 and do the validation in the API - check hostmapping for the host, check placement for the node, we can optimize the redundant scheduler calculation for in_tree later. For review and test sanity I ask that the API validation code comes in a separate patch in the series.
+1 on option3. For more optimization, can we skip b) and c) for non-baremental case assuming if there is Hostmapping then node also will be valid. -gmann
[1] http://eavesdrop.openstack.org/meetings/nova/2019/nova.2019-05-23-21.00.log....
--
Thanks,
Matt
On 6/6/2019 7:53 AM, gmann@ghanshyammann.com wrote:
+1 on option3. For more optimization, can we skip b) and c) for non-baremental case assuming if there is Hostmapping then node also will be valid.
You won't know it's a baremetal node until you get the ComputeNode object and check the hypervisor_type, and at that point you've already validated that it exists. -- Thanks, Matt
participants (4)
-
gmann@ghanshyammann.com
-
Matt Riedemann
-
melanie witt
-
Surya Seetharaman