On 11/21/2019 6:04 AM, Sean Mooney wrote:
i think the behavior might change if the max vaule exceeds the batch size. we group the resues in set of 10? by default if all the vms in a batch go active and latter vms in a different set fail the first vms will remain active. i cant remember which config option contolse that but there is one. its max concurent build or somethign like that.
That batch size option is per-compute. For what Albert was hitting it failed with NoValidHost in the scheduler so the compute isn't involved. What you're describing is likely legacy behavior where the scheduler said, "yup sure putting 20 instances on a few computes is probably OK" and then they raced to do the RT claim on the compute and failed late and went to ERROR while some went ACTIVE. That window was closed for vcpu/ram/disk claims in Pike when the scheduler started using placement to create atomic resource allocation claims. So if someone can reproduce this issue with --max and some go active while some go error in the same request post-pike I'd be surprised. Doing that in *concurrent* requests I could understand since the scheduler could be a bit split brain there but placement still would not be. -- Thanks, Matt