Hi all,
We’re running into an issue, where for two sites with 150-250 ironic nodes on a single conductor and nova-compute instance, we’ve started to get “no hosts available” errors from nova scheduler.
We’re using the blazar-nova filter to match on hosts in specifically tagged aggregates. After adding some debug logs, I found that the “host_state” object passed to the filter seems to have out-of-date aggregate information.
Specifically, if I query the system with “openstack aggregate show …” or “openstack allocation candidate list”, I see the correct aggregate for the nodes in question, but the contents of “host_state” reflect a previous state.
This “staleness” does not seem to correct itself over time, but is resolved by restarting the nova-scheduler process (actually restarting the kolla docker container, but the same effect). However the issues return over the course of a couple
hours.
We haven’t increased the number of nodes, or otherwise changed the hardware, so I’m not sure what could have triggered this issue.
Any advice on further debugging steps would be greatly appreciated. Thank you!
--
Michael Sherman
Infrastructure Lead – Chameleon
Computer Science, University of Chicago
MCS, Argonne National Lab