On 5/6/2019 1:44 PM, Eric Fried wrote:
Addendum: There's another implicit trait-based filter that bears mentioning: Excluding disabled compute hosts.
We have code that disables a compute service when "something goes wrong" in various ways. This code should decorate the compute node's resource provider with a COMPUTE_SERVICE_DISABLED trait, and every GET /allocation_candidates request should include ?required=!COMPUTE_SERVICE_DISABLED, so that we don't retrieve allocation candidates for disabled hosts.
mriedem has started to prototype the code for this [1].
Action: Spec to be written. Code to be polished up. Possibly aspiers to be involved in this bit as well.
efried
Here is the spec [1]. There are noted TODOs and quite a few alternatives listed, mostly alternatives to the proposed design and what's in my PoC. One thing my PoC didn't cover was the service group API and it automatically reporting a service as up or down, I think that will have to be incorp0rated into this, but how best to do that without having this 'disabled' trait management everywhere might be tricky. My PoC tries to make the compute the single place we manage the trait, but that's also problematic if we lose a race with the API to disable a compute before the compute dies, or if MQ drops the call, etc. We might need/want to hook into the update_available_resource periodic to heal / sync the trait if we have an issue like that, or on startup during upgrade, and we likely also need a CLI to sync the trait status manually - at least to aid with the upgrade. Who knew that managing a status reporting daemon could be complicated (oh right everyone). [1] https://review.opendev.org/#/c/657884/ -- Thanks, Matt