<div dir="ltr">So we ended up increasing both max_attempts and host_subset_size and it fixed our issue. Hooray. I think I saw a KB from Red Hat on that exact issue - but I can't find the link anymore...<div><br></div><div>Thank you to Sean and Melanie! :)</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, May 20, 2020 at 1:55 PM Sean Mooney <<a href="mailto:smooney@redhat.com">smooney@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Wed, 2020-05-20 at 11:32 -0400, Laurent Dumont wrote:<br>

> Hey Melanie, Sean,<br>

> <br>

> Thank you! That should cover most of our uses cases. Is there any downside<br>

> to a "subset_size" that would be larger than the actual number of computes?<br>

> We have some env with 4 computes, and others with 100+.<br>

it will basicaly make the weigher irrelevent.<br>

when you use subset_size we basically select randomly form the first "subset_size" hosts in the list<br>

of host returned so so if subset_size is equal or large then the total number of host it will just be a random<br>

selection from the hosts that pass the filter/placment query.<br>

<br>

so you want subset_size to be proportionally small (an order of mangniture or two smaller) compareed to the number of<br>

avlaible hosts  and proptionally equivelent (within 1 order of mangniture or so) of your typeical concurrent instnace<br>

multi create request.<br>

<br>

you want it to be small relitive to the could so the that weigher remian statistally relevent<br>

and similar to the size of the multi create to make the proablity of the same host being selected for an instance low.<br>

<br>

<br>

> <br>

> Laurent<br>

> <br>

> On Tue, May 19, 2020 at 7:33 PM Sean Mooney <<a href="mailto:smooney@redhat.com" target="_blank">smooney@redhat.com</a>> wrote:<br>

> <br>

> > On Tue, 2020-05-19 at 18:23 -0400, Laurent Dumont wrote:<br>

> > > Hey everyone,<br>

> > > <br>

> > > We are seeing a pretty consistent issue with Nova/Scheduler where some<br>

> > > instances creation are hitting the "max_attempts" limits of the<br>

> > <br>

> > scheduler.<br>

> > Well the answer you are not going to like is nova is working as expected<br>

> > and<br>

> > we expect this to happen when you use multi<br>

> > create. placment help reduce the issue but<br>

> > there are some fundemtal issue with how we do retries that make this hard<br>

> > to<br>

> > fix.<br>

> > <br>

> > im not going to go into the detail right now as its not helpful but<br>

> > we have had quryies about this form customer in the past so fortunetly i<br>

> > do have some<br>

> > recomendation i can share<br>

> > <a href="https://bugzilla.redhat.com/show_bug.cgi?id=1759545#c8" rel="noreferrer" target="_blank">https://bugzilla.redhat.com/show_bug.cgi?id=1759545#c8</a><br>

> > well not that i have made the comment public i can :)<br>

> > <br>

> > > <br>

> > > Env : Red Hat Queens<br>

> > > Computes : All the same hardware and specs (even weight throughout)<br>

> > > Nova : Three nova-schedulers<br>

> > > <br>

> > > This can be due to two different factors (from what we've seen) :<br>

> > > <br>

> > >    - Anti-affinity rules are getting triggered during the creation (two<br>

> > >    claims are done within a few milliseconds on the same compute) which<br>

> > <br>

> > counts<br>

> > >    as a retry (we've seen this when spawning 40+ VMs in a single server<br>

> > <br>

> > group<br>

> > >    with maybe 50-55 computes - or even less 14 instances on 20ish<br>

> > <br>

> > computes).<br>

> > yep the only way to completely avoid this issue on queens and depending on<br>

> > what fature you are using on master<br>

> > is to boot the vms serially waiting for each vm to sapwn.<br>

> > <br>

> > >    - We've seen another case where MEMORY_MB becomes an issue (we are<br>

> > >    spinning new instances in the same host-aggregate where VMs are<br>

> > <br>

> > already<br>

> > >    running. Only one VM can run per compute but there are no<br>

> > <br>

> > anti-affinity<br>

> > >    groups to force that between the two deployments. The ressource<br>

> > >    requirements prevent anything else from getting spun on those).<br>

> > >    - The logs look like the following :<br>

> > >       -  Unable to submit allocation for instance<br>

> > >       659ef90e-33b8-42a9-9c8e-fac87278240d (409 {"errors": [{"status":<br>

> > <br>

> > 409,<br>

> > >       "request_id": "req-429c2734-2f2d-4d2d-82d1-fa4ebe12c991",<br>

> > > "detail": "There<br>

> > >       was a conflict when trying to complete your request.\n\n Unable<br>

> > > to allocate<br>

> > >       inventory: Unable to create allocation for 'MEMORY_MB' on<br>

> > > resource provider<br>

> > >       '35b78f3b-8e59-4f2f-8cad-eaf116b7c1c7'. The requested amount would<br>

> > <br>

> > exceed<br>

> > >       the capacity. ", "title": "Conflict"}]}) / Setting instance to<br>

> > <br>

> > ERROR<br>

> > >       state.: MaxRetriesExceeded: Exceeded maximum number of retries.<br>

> > <br>

> > Exhausted<br>

> > >       all hosts available for retrying build failures for instance<br>

> > >       f6d06cca-e9b5-4199-8220-e3ff2e5c2a41.<br>

> > <br>

> > in this case you are racing with ohter instance for the that host.<br>

> > basically when doing a multi create if any vm fails to boot it will go to<br>

> > the next<br>

> > host in the alternate host list and try to create an allcoation againt<br>

> > ther first host in that list.<br>

> > <br>

> > however when the alternate host list was created none of the vms had been<br>

> > sapwned yet.<br>

> > by the time the rety arrive at the conductor one of the other vms could<br>

> > have been schduled to that host either<br>

> > as a first chose or because that other vm retried first and won the race.<br>

> > <br>

> > when this happens we then try the next host in the list wehre we can race<br>

> > again.<br>

> > <br>

> > since the retries happen at the cell conductor level without going back to<br>

> > the schduler again we are not going to check<br>

> > the current status of the host using the anti affintiy filter or anti<br>

> > affintiy weigher during the retry so while it was<br>

> > vaild intially i can be invalid when we try to use the alternate host.<br>

> > <br>

> > the only way to fix that is to have retrys not use alternate hosts and<br>

> > instead have each retry return the full scudling<br>

> > process so that it can make a desicion based on the current state of the<br>

> > server not the old view.<br>

> > >    - I do believe we are hitting this issue as well :<br>

> > >    <a href="https://bugs.launchpad.net/nova/+bug/1837955" rel="noreferrer" target="_blank">https://bugs.launchpad.net/nova/+bug/1837955</a><br>

> > >       - In all the cases where the Stacks creation failed, one instance<br>

> > <br>

> > was<br>

> > >       left in the Build state for 120 minutes and then finally failed.<br>

> > > <br>

> > > From what we can gather, there are a couple of parameters that be be<br>

> > > tweaked.<br>

> > > <br>

> > >    1. host_subset_size (Return X number of host instead of 1?)<br>

> > >    2. randomize_allocation_candidates (Not 100% on this one)<br>

> > >    3. shuffle_best_same_weighed_hosts (Return a random of X number of<br>

> > >    computes if they are all equal (instance of the same list for all<br>

> > >    scheduling requests))<br>

> > >    4. max_attempts (how many times the Scheduler will try to fit the<br>

> > >    instance somewhere)<br>

> > > <br>

> > > We've already raised "max_attempts" to 5 from the default of 3 and will<br>

> > > raise it further. That said, what are the recommendations for the rest of<br>

> > > the settings? We are not exactly concerned with stacking vs spreading<br>

> > <br>

> > (but<br>

> > > that's always nice) of the instances but rather making sure deployments<br>

> > > fail because of real reasons and not just because Nova/Scheduler keeps<br>

> > > stepping on it's own toes.<br>

> > <br>

> > <a href="https://bugzilla.redhat.com/show_bug.cgi?id=1759545#c8" rel="noreferrer" target="_blank">https://bugzilla.redhat.com/show_bug.cgi?id=1759545#c8</a><br>

> > has some suggestions<br>

> > but tl;dr it should be safe to set max_attempts=10 if you set<br>

> > subset_size=15 shuffle_best_same_weighed_hosts=true<br>

> > that said i really would not put max_attempts over 10, max_attempts 5<br>

> > should be more then enough.<br>

> > subset_size=15 is a little bit arbiraty. the best value will depend on the<br>

> > type ical size of your deplopyment and the<br>

> > size of your cloud. randomize_allocation_candidates help if and only if<br>

> > you have limite the number of allocation<br>

> > candiates retruned by placment to subset of your cloud hosts.<br>

> > <br>

> > e.g. if you set the placemment allcation candiate limit to 10 on for a<br>

> > cloud with 100 host then you should set<br>

> > randomize_allocation_candidates=true so that you do not get a bias that<br>

> > will pack host baded on the natural db order.<br>

> > the default limit for alloction candiates is 1000 so unless you have more<br>

> > then 1000 hosts or have changed that limit you<br>

> > do not need to set this.<br>

> > <br>

> > > <br>

> > > Thanks!<br>

> > <br>

> > <br>

<br>

</blockquote></div>