i would generaly say that it not a race although it is undefined behviaor.

 

Is just calling the “add host to aggregate API” concurrently, undefined behavior?

 

My sequence of operations is:

  1. Add N hosts to an aggregate concurrently
  2. Wait a while (minutes to hours), and verify that “aggregate show” lists the correct hosts in the aggregate
  3. Then attempt to schedule N compute instances, with a filter that checks host membership in the aggregate
  4. Observe scheduling failures

 

To rule out our code as the issue, I was able to reproduce the behavior using devstack on master, using the nova_fake driver with 10 fake compute services and the aggregate_instance_extra_specs filter instead of ironic and the blazar-nova filter.

 

So long as the N “add_host_to_aggregate” calls to nova_api are made in parallel, there’s a decent probability that the host_state aggregate info passed to the filters will not agree with the values in the DB.

 

This doesn’t depend on launching instances quickly after making the changes, the inconsistency does not seem to ever resolve until nova-scheduler is restarted.

 

-Mike