[openstack-dev] [nova] Bug 1781710 killing the check queue

Matt Riedemann mriedemos at gmail.com
Wed Jul 18 16:14:26 UTC 2018

As can be seen from logstash [1] this bug is hurting us pretty bad in 
the check queue.

I thought I originally had this fixed with [2] but that turned out to 
only be part of the issue.

I think I've identified the problem but I have failed to write a 
recreate regression test [3] because (I think) it's due to random 
ordering of which request spec we select to send to the scheduler during 
a multi-create request (and I tried making that predictable by sorting 
the instances by uuid in both conductor and the scheduler but that 
didn't make a difference in my test).

I started with one fix yesterday [4] but that would regress an earlier 
fix for resizing servers to the same host which are in an anti-affinity 
group. If we went that route, it will involve changes to how we handle 
RequestSpec.num_instances (either not persist it, or reset it during 
move operations).

After talking with Sean Mooney, we have another fix which is 
self-contained to the scheduler [5] so we wouldn't need to make any 
changes to the RequestSpec handling in conductor. It's admittedly a bit 
hairy, so I'm asking for some eyes on it since either way we go, we 
should get going soon before we hit the FF and RC1 rush which *always* 
kills the gate.

[1] http://status.openstack.org/elastic-recheck/index.html#1781710
[2] https://review.openstack.org/#/c/582976/
[3] https://review.openstack.org/#/c/583339
[4] https://review.openstack.org/#/c/583351
[5] https://review.openstack.org/#/c/583347




