<div dir="ltr">Hey Melanie, Sean,<div><br></div><div>Thank you! That should cover most of our uses cases. Is there any downside to a "subset_size" that would be larger than the actual number of computes? We have some env with 4 computes, and others with 100+.</div><div><br></div><div>Laurent</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, May 19, 2020 at 7:33 PM Sean Mooney <<a href="mailto:smooney@redhat.com">smooney@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Tue, 2020-05-19 at 18:23 -0400, Laurent Dumont wrote:<br>

> Hey everyone,<br>

> <br>

> We are seeing a pretty consistent issue with Nova/Scheduler where some<br>

> instances creation are hitting the "max_attempts" limits of the scheduler.<br>

Well the answer you are not going to like is nova is working as expected and<br>

we expect this to happen when you use multi<br>

create. placment help reduce the issue but<br>

there are some fundemtal issue with how we do retries that make this hard to<br>

fix.<br>

<br>

im not going to go into the detail right now as its not helpful but<br>

we have had quryies about this form customer in the past so fortunetly i do have some<br>

recomendation i can share <a href="https://bugzilla.redhat.com/show_bug.cgi?id=1759545#c8" rel="noreferrer" target="_blank">https://bugzilla.redhat.com/show_bug.cgi?id=1759545#c8</a><br>

well not that i have made the comment public i can :)<br>

<br>

> <br>

> Env : Red Hat Queens<br>

> Computes : All the same hardware and specs (even weight throughout)<br>

> Nova : Three nova-schedulers<br>

> <br>

> This can be due to two different factors (from what we've seen) :<br>

> <br>

>    - Anti-affinity rules are getting triggered during the creation (two<br>

>    claims are done within a few milliseconds on the same compute) which counts<br>

>    as a retry (we've seen this when spawning 40+ VMs in a single server group<br>

>    with maybe 50-55 computes - or even less 14 instances on 20ish computes).<br>

yep the only way to completely avoid this issue on queens and depending on what fature you are using on master<br>

is to boot the vms serially waiting for each vm to sapwn.<br>

<br>

>    - We've seen another case where MEMORY_MB becomes an issue (we are<br>

>    spinning new instances in the same host-aggregate where VMs are already<br>

>    running. Only one VM can run per compute but there are no anti-affinity<br>

>    groups to force that between the two deployments. The ressource<br>

>    requirements prevent anything else from getting spun on those).<br>

>    - The logs look like the following :<br>

>       -  Unable to submit allocation for instance<br>

>       659ef90e-33b8-42a9-9c8e-fac87278240d (409 {"errors": [{"status": 409,<br>

>       "request_id": "req-429c2734-2f2d-4d2d-82d1-fa4ebe12c991",<br>

> "detail": "There<br>

>       was a conflict when trying to complete your request.\n\n Unable<br>

> to allocate<br>

>       inventory: Unable to create allocation for 'MEMORY_MB' on<br>

> resource provider<br>

>       '35b78f3b-8e59-4f2f-8cad-eaf116b7c1c7'. The requested amount would exceed<br>

>       the capacity. ", "title": "Conflict"}]}) / Setting instance to ERROR<br>

>       state.: MaxRetriesExceeded: Exceeded maximum number of retries. Exhausted<br>

>       all hosts available for retrying build failures for instance<br>

>       f6d06cca-e9b5-4199-8220-e3ff2e5c2a41.<br>

in this case you are racing with ohter instance for the that host.<br>

basically when doing a multi create if any vm fails to boot it will go to the next<br>

host in the alternate host list and try to create an allcoation againt ther first host in that list.<br>

<br>

however when the alternate host list was created none of the vms had been sapwned yet.<br>

by the time the rety arrive at the conductor one of the other vms could have been schduled to that host either<br>

as a first chose or because that other vm retried first and won the race.<br>

<br>

when this happens we then try the next host in the list wehre we can race again.<br>

<br>

since the retries happen at the cell conductor level without going back to the schduler again we are not going to check<br>

the current status of the host using the anti affintiy filter or anti affintiy weigher during the retry so while it was<br>

vaild intially i can be invalid when we try to use the alternate host.<br>

<br>

the only way to fix that is to have retrys not use alternate hosts and instead have each retry return the full scudling<br>

process so that it can make a desicion based on the current state of the server not the old view.<br>

>    - I do believe we are hitting this issue as well :<br>

>    <a href="https://bugs.launchpad.net/nova/+bug/1837955" rel="noreferrer" target="_blank">https://bugs.launchpad.net/nova/+bug/1837955</a><br>

>       - In all the cases where the Stacks creation failed, one instance was<br>

>       left in the Build state for 120 minutes and then finally failed.<br>

> <br>

> From what we can gather, there are a couple of parameters that be be<br>

> tweaked.<br>

> <br>

>    1. host_subset_size (Return X number of host instead of 1?)<br>

>    2. randomize_allocation_candidates (Not 100% on this one)<br>

>    3. shuffle_best_same_weighed_hosts (Return a random of X number of<br>

>    computes if they are all equal (instance of the same list for all<br>

>    scheduling requests))<br>

>    4. max_attempts (how many times the Scheduler will try to fit the<br>

>    instance somewhere)<br>

> <br>

> We've already raised "max_attempts" to 5 from the default of 3 and will<br>

> raise it further. That said, what are the recommendations for the rest of<br>

> the settings? We are not exactly concerned with stacking vs spreading (but<br>

> that's always nice) of the instances but rather making sure deployments<br>

> fail because of real reasons and not just because Nova/Scheduler keeps<br>

> stepping on it's own toes.<br>

<a href="https://bugzilla.redhat.com/show_bug.cgi?id=1759545#c8" rel="noreferrer" target="_blank">https://bugzilla.redhat.com/show_bug.cgi?id=1759545#c8</a><br>

has some suggestions<br>

but tl;dr it should be safe to set max_attempts=10 if you set subset_size=15 shuffle_best_same_weighed_hosts=true<br>

that said i really would not put max_attempts over 10, max_attempts 5 should be more then enough. <br>

subset_size=15 is a little bit arbiraty. the best value will depend on the type ical size of your deplopyment and the<br>

size of your cloud. randomize_allocation_candidates help if and only if you have limite the number of allocation<br>

candiates retruned by placment to subset of your cloud hosts.<br>

<br>

e.g. if you set the placemment allcation candiate limit to 10 on for a cloud with 100 host then you should set<br>

randomize_allocation_candidates=true so that you do not get a bias that will pack host baded on the natural db order.<br>

the default limit for alloction candiates is 1000 so unless you have more then 1000 hosts or have changed that limit you<br>

do not need to set this.<br>

<br>

> <br>

> Thanks!<br>

<br>

</blockquote></div>