<div dir="ltr">Hey Melanie, Sean,<div><br></div><div>Thank you! That should cover most of our uses cases. Is there any downside to a "subset_size" that would be larger than the actual number of computes? We have some env with 4 computes, and others with 100+.</div><div><br></div><div>Laurent</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, May 19, 2020 at 7:33 PM Sean Mooney <<a href="mailto:smooney@redhat.com">smooney@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Tue, 2020-05-19 at 18:23 -0400, Laurent Dumont wrote:<br>
> Hey everyone,<br>
> <br>
> We are seeing a pretty consistent issue with Nova/Scheduler where some<br>
> instances creation are hitting the "max_attempts" limits of the scheduler.<br>
Well the answer you are not going to like is nova is working as expected and<br>
we expect this to happen when you use multi<br>
create. placment help reduce the issue but<br>
there are some fundemtal issue with how we do retries that make this hard to<br>
fix.<br>
<br>
im not going to go into the detail right now as its not helpful but<br>
we have had quryies about this form customer in the past so fortunetly i do have some<br>
recomendation i can share <a href="https://bugzilla.redhat.com/show_bug.cgi?id=1759545#c8" rel="noreferrer" target="_blank">https://bugzilla.redhat.com/show_bug.cgi?id=1759545#c8</a><br>
well not that i have made the comment public i can :)<br>
<br>
> <br>
> Env : Red Hat Queens<br>
> Computes : All the same hardware and specs (even weight throughout)<br>
> Nova : Three nova-schedulers<br>
> <br>
> This can be due to two different factors (from what we've seen) :<br>
> <br>
> - Anti-affinity rules are getting triggered during the creation (two<br>
> claims are done within a few milliseconds on the same compute) which counts<br>
> as a retry (we've seen this when spawning 40+ VMs in a single server group<br>
> with maybe 50-55 computes - or even less 14 instances on 20ish computes).<br>
yep the only way to completely avoid this issue on queens and depending on what fature you are using on master<br>
is to boot the vms serially waiting for each vm to sapwn.<br>
<br>
> - We've seen another case where MEMORY_MB becomes an issue (we are<br>
> spinning new instances in the same host-aggregate where VMs are already<br>
> running. Only one VM can run per compute but there are no anti-affinity<br>
> groups to force that between the two deployments. The ressource<br>
> requirements prevent anything else from getting spun on those).<br>
> - The logs look like the following :<br>
> - Unable to submit allocation for instance<br>
> 659ef90e-33b8-42a9-9c8e-fac87278240d (409 {"errors": [{"status": 409,<br>
> "request_id": "req-429c2734-2f2d-4d2d-82d1-fa4ebe12c991",<br>
> "detail": "There<br>
> was a conflict when trying to complete your request.\n\n Unable<br>
> to allocate<br>
> inventory: Unable to create allocation for 'MEMORY_MB' on<br>
> resource provider<br>
> '35b78f3b-8e59-4f2f-8cad-eaf116b7c1c7'. The requested amount would exceed<br>
> the capacity. ", "title": "Conflict"}]}) / Setting instance to ERROR<br>
> state.: MaxRetriesExceeded: Exceeded maximum number of retries. Exhausted<br>
> all hosts available for retrying build failures for instance<br>
> f6d06cca-e9b5-4199-8220-e3ff2e5c2a41.<br>
in this case you are racing with ohter instance for the that host.<br>
basically when doing a multi create if any vm fails to boot it will go to the next<br>
host in the alternate host list and try to create an allcoation againt ther first host in that list.<br>
<br>
however when the alternate host list was created none of the vms had been sapwned yet.<br>
by the time the rety arrive at the conductor one of the other vms could have been schduled to that host either<br>
as a first chose or because that other vm retried first and won the race.<br>
<br>
when this happens we then try the next host in the list wehre we can race again.<br>
<br>
since the retries happen at the cell conductor level without going back to the schduler again we are not going to check<br>
the current status of the host using the anti affintiy filter or anti affintiy weigher during the retry so while it was<br>
vaild intially i can be invalid when we try to use the alternate host.<br>
<br>
the only way to fix that is to have retrys not use alternate hosts and instead have each retry return the full scudling<br>
process so that it can make a desicion based on the current state of the server not the old view.<br>
> - I do believe we are hitting this issue as well :<br>
> <a href="https://bugs.launchpad.net/nova/+bug/1837955" rel="noreferrer" target="_blank">https://bugs.launchpad.net/nova/+bug/1837955</a><br>
> - In all the cases where the Stacks creation failed, one instance was<br>
> left in the Build state for 120 minutes and then finally failed.<br>
> <br>
> From what we can gather, there are a couple of parameters that be be<br>
> tweaked.<br>
> <br>
> 1. host_subset_size (Return X number of host instead of 1?)<br>
> 2. randomize_allocation_candidates (Not 100% on this one)<br>
> 3. shuffle_best_same_weighed_hosts (Return a random of X number of<br>
> computes if they are all equal (instance of the same list for all<br>
> scheduling requests))<br>
> 4. max_attempts (how many times the Scheduler will try to fit the<br>
> instance somewhere)<br>
> <br>
> We've already raised "max_attempts" to 5 from the default of 3 and will<br>
> raise it further. That said, what are the recommendations for the rest of<br>
> the settings? We are not exactly concerned with stacking vs spreading (but<br>
> that's always nice) of the instances but rather making sure deployments<br>
> fail because of real reasons and not just because Nova/Scheduler keeps<br>
> stepping on it's own toes.<br>
<a href="https://bugzilla.redhat.com/show_bug.cgi?id=1759545#c8" rel="noreferrer" target="_blank">https://bugzilla.redhat.com/show_bug.cgi?id=1759545#c8</a><br>
has some suggestions<br>
but tl;dr it should be safe to set max_attempts=10 if you set subset_size=15 shuffle_best_same_weighed_hosts=true<br>
that said i really would not put max_attempts over 10, max_attempts 5 should be more then enough. <br>
subset_size=15 is a little bit arbiraty. the best value will depend on the type ical size of your deplopyment and the<br>
size of your cloud. randomize_allocation_candidates help if and only if you have limite the number of allocation<br>
candiates retruned by placment to subset of your cloud hosts.<br>
<br>
e.g. if you set the placemment allcation candiate limit to 10 on for a cloud with 100 host then you should set<br>
randomize_allocation_candidates=true so that you do not get a bias that will pack host baded on the natural db order.<br>
the default limit for alloction candiates is 1000 so unless you have more then 1000 hosts or have changed that limit you<br>
do not need to set this.<br>
<br>
> <br>
> Thanks!<br>
<br>
</blockquote></div>