[Nova][Scheduler] Reducing race-conditions and re-scheduling during creation of multiple high-ressources instances or instances with anti-affinity.

Laurent Dumont laurentfdumont at gmail.com
Wed May 20 15:32:03 UTC 2020


Hey Melanie, Sean,

Thank you! That should cover most of our uses cases. Is there any downside
to a "subset_size" that would be larger than the actual number of computes?
We have some env with 4 computes, and others with 100+.

Laurent

On Tue, May 19, 2020 at 7:33 PM Sean Mooney <smooney at redhat.com> wrote:

> On Tue, 2020-05-19 at 18:23 -0400, Laurent Dumont wrote:
> > Hey everyone,
> >
> > We are seeing a pretty consistent issue with Nova/Scheduler where some
> > instances creation are hitting the "max_attempts" limits of the
> scheduler.
> Well the answer you are not going to like is nova is working as expected
> and
> we expect this to happen when you use multi
> create. placment help reduce the issue but
> there are some fundemtal issue with how we do retries that make this hard
> to
> fix.
>
> im not going to go into the detail right now as its not helpful but
> we have had quryies about this form customer in the past so fortunetly i
> do have some
> recomendation i can share
> https://bugzilla.redhat.com/show_bug.cgi?id=1759545#c8
> well not that i have made the comment public i can :)
>
> >
> > Env : Red Hat Queens
> > Computes : All the same hardware and specs (even weight throughout)
> > Nova : Three nova-schedulers
> >
> > This can be due to two different factors (from what we've seen) :
> >
> >    - Anti-affinity rules are getting triggered during the creation (two
> >    claims are done within a few milliseconds on the same compute) which
> counts
> >    as a retry (we've seen this when spawning 40+ VMs in a single server
> group
> >    with maybe 50-55 computes - or even less 14 instances on 20ish
> computes).
> yep the only way to completely avoid this issue on queens and depending on
> what fature you are using on master
> is to boot the vms serially waiting for each vm to sapwn.
>
> >    - We've seen another case where MEMORY_MB becomes an issue (we are
> >    spinning new instances in the same host-aggregate where VMs are
> already
> >    running. Only one VM can run per compute but there are no
> anti-affinity
> >    groups to force that between the two deployments. The ressource
> >    requirements prevent anything else from getting spun on those).
> >    - The logs look like the following :
> >       -  Unable to submit allocation for instance
> >       659ef90e-33b8-42a9-9c8e-fac87278240d (409 {"errors": [{"status":
> 409,
> >       "request_id": "req-429c2734-2f2d-4d2d-82d1-fa4ebe12c991",
> > "detail": "There
> >       was a conflict when trying to complete your request.\n\n Unable
> > to allocate
> >       inventory: Unable to create allocation for 'MEMORY_MB' on
> > resource provider
> >       '35b78f3b-8e59-4f2f-8cad-eaf116b7c1c7'. The requested amount would
> exceed
> >       the capacity. ", "title": "Conflict"}]}) / Setting instance to
> ERROR
> >       state.: MaxRetriesExceeded: Exceeded maximum number of retries.
> Exhausted
> >       all hosts available for retrying build failures for instance
> >       f6d06cca-e9b5-4199-8220-e3ff2e5c2a41.
> in this case you are racing with ohter instance for the that host.
> basically when doing a multi create if any vm fails to boot it will go to
> the next
> host in the alternate host list and try to create an allcoation againt
> ther first host in that list.
>
> however when the alternate host list was created none of the vms had been
> sapwned yet.
> by the time the rety arrive at the conductor one of the other vms could
> have been schduled to that host either
> as a first chose or because that other vm retried first and won the race.
>
> when this happens we then try the next host in the list wehre we can race
> again.
>
> since the retries happen at the cell conductor level without going back to
> the schduler again we are not going to check
> the current status of the host using the anti affintiy filter or anti
> affintiy weigher during the retry so while it was
> vaild intially i can be invalid when we try to use the alternate host.
>
> the only way to fix that is to have retrys not use alternate hosts and
> instead have each retry return the full scudling
> process so that it can make a desicion based on the current state of the
> server not the old view.
> >    - I do believe we are hitting this issue as well :
> >    https://bugs.launchpad.net/nova/+bug/1837955
> >       - In all the cases where the Stacks creation failed, one instance
> was
> >       left in the Build state for 120 minutes and then finally failed.
> >
> > From what we can gather, there are a couple of parameters that be be
> > tweaked.
> >
> >    1. host_subset_size (Return X number of host instead of 1?)
> >    2. randomize_allocation_candidates (Not 100% on this one)
> >    3. shuffle_best_same_weighed_hosts (Return a random of X number of
> >    computes if they are all equal (instance of the same list for all
> >    scheduling requests))
> >    4. max_attempts (how many times the Scheduler will try to fit the
> >    instance somewhere)
> >
> > We've already raised "max_attempts" to 5 from the default of 3 and will
> > raise it further. That said, what are the recommendations for the rest of
> > the settings? We are not exactly concerned with stacking vs spreading
> (but
> > that's always nice) of the instances but rather making sure deployments
> > fail because of real reasons and not just because Nova/Scheduler keeps
> > stepping on it's own toes.
> https://bugzilla.redhat.com/show_bug.cgi?id=1759545#c8
> has some suggestions
> but tl;dr it should be safe to set max_attempts=10 if you set
> subset_size=15 shuffle_best_same_weighed_hosts=true
> that said i really would not put max_attempts over 10, max_attempts 5
> should be more then enough.
> subset_size=15 is a little bit arbiraty. the best value will depend on the
> type ical size of your deplopyment and the
> size of your cloud. randomize_allocation_candidates help if and only if
> you have limite the number of allocation
> candiates retruned by placment to subset of your cloud hosts.
>
> e.g. if you set the placemment allcation candiate limit to 10 on for a
> cloud with 100 host then you should set
> randomize_allocation_candidates=true so that you do not get a bias that
> will pack host baded on the natural db order.
> the default limit for alloction candiates is 1000 so unless you have more
> then 1000 hosts or have changed that limit you
> do not need to set this.
>
> >
> > Thanks!
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20200520/1ab85dcd/attachment.html>


More information about the openstack-discuss mailing list