Hi all,

 

We’re running into an issue, where for two sites with 150-250 ironic nodes on a single conductor and nova-compute instance, we’ve started to get “no hosts available” errors from nova scheduler.

 

We’re using the blazar-nova filter to match on hosts in specifically tagged aggregates. After adding some debug logs, I found that the “host_state” object passed to the filter seems to have out-of-date aggregate information.

 

Specifically, if I query the system with “openstack aggregate show …” or “openstack allocation candidate list”, I see the correct aggregate for the nodes in question, but  the contents of “host_state” reflect a previous state.

 

This “staleness” does not seem to correct itself over time, but is resolved by restarting the nova-scheduler process (actually restarting the kolla docker container, but the same effect). However the issues return over the course of a couple hours.

 

We haven’t increased the number of nodes, or otherwise changed the hardware, so I’m not sure what could have triggered this issue.

 

Any advice on further debugging steps would be greatly appreciated. Thank you!

 

-- 

Michael Sherman

Infrastructure Lead – Chameleon

Computer Science, University of Chicago

MCS, Argonne National Lab