[Nova][Scheduler] Reducing race-conditions and re-scheduling during creation of multiple high-ressources instances or instances with anti-affinity.

Laurent Dumont laurentfdumont at gmail.com
Mon May 25 16:47:27 UTC 2020


So we ended up increasing both max_attempts and host_subset_size and it
fixed our issue. Hooray. I think I saw a KB from Red Hat on that exact
issue - but I can't find the link anymore...

Thank you to Sean and Melanie! :)

On Wed, May 20, 2020 at 1:55 PM Sean Mooney <smooney at redhat.com> wrote:

> On Wed, 2020-05-20 at 11:32 -0400, Laurent Dumont wrote:
> > Hey Melanie, Sean,
> >
> > Thank you! That should cover most of our uses cases. Is there any
> downside
> > to a "subset_size" that would be larger than the actual number of
> computes?
> > We have some env with 4 computes, and others with 100+.
> it will basicaly make the weigher irrelevent.
> when you use subset_size we basically select randomly form the first
> "subset_size" hosts in the list
> of host returned so so if subset_size is equal or large then the total
> number of host it will just be a random
> selection from the hosts that pass the filter/placment query.
>
> so you want subset_size to be proportionally small (an order of mangniture
> or two smaller) compareed to the number of
> avlaible hosts  and proptionally equivelent (within 1 order of mangniture
> or so) of your typeical concurrent instnace
> multi create request.
>
> you want it to be small relitive to the could so the that weigher remian
> statistally relevent
> and similar to the size of the multi create to make the proablity of the
> same host being selected for an instance low.
>
>
> >
> > Laurent
> >
> > On Tue, May 19, 2020 at 7:33 PM Sean Mooney <smooney at redhat.com> wrote:
> >
> > > On Tue, 2020-05-19 at 18:23 -0400, Laurent Dumont wrote:
> > > > Hey everyone,
> > > >
> > > > We are seeing a pretty consistent issue with Nova/Scheduler where
> some
> > > > instances creation are hitting the "max_attempts" limits of the
> > >
> > > scheduler.
> > > Well the answer you are not going to like is nova is working as
> expected
> > > and
> > > we expect this to happen when you use multi
> > > create. placment help reduce the issue but
> > > there are some fundemtal issue with how we do retries that make this
> hard
> > > to
> > > fix.
> > >
> > > im not going to go into the detail right now as its not helpful but
> > > we have had quryies about this form customer in the past so fortunetly
> i
> > > do have some
> > > recomendation i can share
> > > https://bugzilla.redhat.com/show_bug.cgi?id=1759545#c8
> > > well not that i have made the comment public i can :)
> > >
> > > >
> > > > Env : Red Hat Queens
> > > > Computes : All the same hardware and specs (even weight throughout)
> > > > Nova : Three nova-schedulers
> > > >
> > > > This can be due to two different factors (from what we've seen) :
> > > >
> > > >    - Anti-affinity rules are getting triggered during the creation
> (two
> > > >    claims are done within a few milliseconds on the same compute)
> which
> > >
> > > counts
> > > >    as a retry (we've seen this when spawning 40+ VMs in a single
> server
> > >
> > > group
> > > >    with maybe 50-55 computes - or even less 14 instances on 20ish
> > >
> > > computes).
> > > yep the only way to completely avoid this issue on queens and
> depending on
> > > what fature you are using on master
> > > is to boot the vms serially waiting for each vm to sapwn.
> > >
> > > >    - We've seen another case where MEMORY_MB becomes an issue (we are
> > > >    spinning new instances in the same host-aggregate where VMs are
> > >
> > > already
> > > >    running. Only one VM can run per compute but there are no
> > >
> > > anti-affinity
> > > >    groups to force that between the two deployments. The ressource
> > > >    requirements prevent anything else from getting spun on those).
> > > >    - The logs look like the following :
> > > >       -  Unable to submit allocation for instance
> > > >       659ef90e-33b8-42a9-9c8e-fac87278240d (409 {"errors":
> [{"status":
> > >
> > > 409,
> > > >       "request_id": "req-429c2734-2f2d-4d2d-82d1-fa4ebe12c991",
> > > > "detail": "There
> > > >       was a conflict when trying to complete your request.\n\n Unable
> > > > to allocate
> > > >       inventory: Unable to create allocation for 'MEMORY_MB' on
> > > > resource provider
> > > >       '35b78f3b-8e59-4f2f-8cad-eaf116b7c1c7'. The requested amount
> would
> > >
> > > exceed
> > > >       the capacity. ", "title": "Conflict"}]}) / Setting instance to
> > >
> > > ERROR
> > > >       state.: MaxRetriesExceeded: Exceeded maximum number of retries.
> > >
> > > Exhausted
> > > >       all hosts available for retrying build failures for instance
> > > >       f6d06cca-e9b5-4199-8220-e3ff2e5c2a41.
> > >
> > > in this case you are racing with ohter instance for the that host.
> > > basically when doing a multi create if any vm fails to boot it will go
> to
> > > the next
> > > host in the alternate host list and try to create an allcoation againt
> > > ther first host in that list.
> > >
> > > however when the alternate host list was created none of the vms had
> been
> > > sapwned yet.
> > > by the time the rety arrive at the conductor one of the other vms could
> > > have been schduled to that host either
> > > as a first chose or because that other vm retried first and won the
> race.
> > >
> > > when this happens we then try the next host in the list wehre we can
> race
> > > again.
> > >
> > > since the retries happen at the cell conductor level without going
> back to
> > > the schduler again we are not going to check
> > > the current status of the host using the anti affintiy filter or anti
> > > affintiy weigher during the retry so while it was
> > > vaild intially i can be invalid when we try to use the alternate host.
> > >
> > > the only way to fix that is to have retrys not use alternate hosts and
> > > instead have each retry return the full scudling
> > > process so that it can make a desicion based on the current state of
> the
> > > server not the old view.
> > > >    - I do believe we are hitting this issue as well :
> > > >    https://bugs.launchpad.net/nova/+bug/1837955
> > > >       - In all the cases where the Stacks creation failed, one
> instance
> > >
> > > was
> > > >       left in the Build state for 120 minutes and then finally
> failed.
> > > >
> > > > From what we can gather, there are a couple of parameters that be be
> > > > tweaked.
> > > >
> > > >    1. host_subset_size (Return X number of host instead of 1?)
> > > >    2. randomize_allocation_candidates (Not 100% on this one)
> > > >    3. shuffle_best_same_weighed_hosts (Return a random of X number of
> > > >    computes if they are all equal (instance of the same list for all
> > > >    scheduling requests))
> > > >    4. max_attempts (how many times the Scheduler will try to fit the
> > > >    instance somewhere)
> > > >
> > > > We've already raised "max_attempts" to 5 from the default of 3 and
> will
> > > > raise it further. That said, what are the recommendations for the
> rest of
> > > > the settings? We are not exactly concerned with stacking vs spreading
> > >
> > > (but
> > > > that's always nice) of the instances but rather making sure
> deployments
> > > > fail because of real reasons and not just because Nova/Scheduler
> keeps
> > > > stepping on it's own toes.
> > >
> > > https://bugzilla.redhat.com/show_bug.cgi?id=1759545#c8
> > > has some suggestions
> > > but tl;dr it should be safe to set max_attempts=10 if you set
> > > subset_size=15 shuffle_best_same_weighed_hosts=true
> > > that said i really would not put max_attempts over 10, max_attempts 5
> > > should be more then enough.
> > > subset_size=15 is a little bit arbiraty. the best value will depend on
> the
> > > type ical size of your deplopyment and the
> > > size of your cloud. randomize_allocation_candidates help if and only if
> > > you have limite the number of allocation
> > > candiates retruned by placment to subset of your cloud hosts.
> > >
> > > e.g. if you set the placemment allcation candiate limit to 10 on for a
> > > cloud with 100 host then you should set
> > > randomize_allocation_candidates=true so that you do not get a bias that
> > > will pack host baded on the natural db order.
> > > the default limit for alloction candiates is 1000 so unless you have
> more
> > > then 1000 hosts or have changed that limit you
> > > do not need to set this.
> > >
> > > >
> > > > Thanks!
> > >
> > >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20200525/2e8f9cfa/attachment-0001.html>


More information about the openstack-discuss mailing list