[Nova][Scheduler] Reducing race-conditions and re-scheduling during creation of multiple high-ressources instances or instances with anti-affinity.

Laurent Dumont

20 May 2020 20 May '20

2:53 a.m.

Hey everyone, We are seeing a pretty consistent issue with Nova/Scheduler where some instances creation are hitting the "max_attempts" limits of the scheduler. Env : Red Hat Queens Computes : All the same hardware and specs (even weight throughout) Nova : Three nova-schedulers This can be due to two different factors (from what we've seen) : - Anti-affinity rules are getting triggered during the creation (two claims are done within a few milliseconds on the same compute) which counts as a retry (we've seen this when spawning 40+ VMs in a single server group with maybe 50-55 computes - or even less 14 instances on 20ish computes). - We've seen another case where MEMORY_MB becomes an issue (we are spinning new instances in the same host-aggregate where VMs are already running. Only one VM can run per compute but there are no anti-affinity groups to force that between the two deployments. The ressource requirements prevent anything else from getting spun on those). - The logs look like the following : - Unable to submit allocation for instance 659ef90e-33b8-42a9-9c8e-fac87278240d (409 {"errors": [{"status": 409, "request_id": "req-429c2734-2f2d-4d2d-82d1-fa4ebe12c991", "detail": "There was a conflict when trying to complete your request.\n\n Unable to allocate inventory: Unable to create allocation for 'MEMORY_MB' on resource provider '35b78f3b-8e59-4f2f-8cad-eaf116b7c1c7'. The requested amount would exceed the capacity. ", "title": "Conflict"}]}) / Setting instance to ERROR state.: MaxRetriesExceeded: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance f6d06cca-e9b5-4199-8220-e3ff2e5c2a41. - I do believe we are hitting this issue as well : https://bugs.launchpad.net/nova/+bug/1837955 - In all the cases where the Stacks creation failed, one instance was left in the Build state for 120 minutes and then finally failed.

...

From what we can gather, there are a couple of parameters that be be tweaked.

1. host_subset_size (Return X number of host instead of 1?) 2. randomize_allocation_candidates (Not 100% on this one) 3. shuffle_best_same_weighed_hosts (Return a random of X number of computes if they are all equal (instance of the same list for all scheduling requests)) 4. max_attempts (how many times the Scheduler will try to fit the instance somewhere) We've already raised "max_attempts" to 5 from the default of 3 and will raise it further. That said, what are the recommendations for the rest of the settings? We are not exactly concerned with stacking vs spreading (but that's always nice) of the instances but rather making sure deployments fail because of real reasons and not just because Nova/Scheduler keeps stepping on it's own toes. Thanks!

Attachments:

attachment.html (text/html — 3.3 KB)

Show replies by date

melanie witt

20 May 20 May

3:40 a.m.

On 5/19/20 15:23, Laurent Dumont wrote:

...

From what we can gather, there are a couple of parameters that be be tweaked.

1. host_subset_size (Return X number of host instead of 1?) 2. randomize_allocation_candidates (Not 100% on this one) 3. shuffle_best_same_weighed_hosts (Return a random of X number of computes if they are all equal (instance of the same list for all scheduling requests)) 4. max_attempts (how many times the Scheduler will try to fit the instance somewhere)

We've already raised "max_attempts" to 5 from the default of 3 and will raise it further. That said, what are the recommendations for the rest of the settings? We are not exactly concerned with stacking vs spreading (but that's always nice) of the instances but rather making sure deployments fail because of real reasons and not just because Nova/Scheduler keeps stepping on it's own toes.

This is something I've written in the past related to the anti-affinity piece of what you're describing, that might be of help: https://bugzilla.redhat.com/show_bug.cgi?id=1780380#c4 Option (2) in your list only helps if you have > 1000 hosts in your deployment and you want to make sure resource provider candidates beyond the same first 1000 are regularly made available for scheduling (by randomizing before returning the top 1000 weighted hosts). The placement API will limit the maximum number of returned allocation candidates to 1000 for performance reasons. Option (3) in your list only helps if you have lots of hosts being weighed equally and you need some randomization per exact weight to help prevent collisions. This is usually applicable to requests for certain NUMA topology and you get many hosts weighted equally. Hope this helps, -melanie

melanie witt

3:47 a.m.

On 5/19/20 16:10, melanie witt wrote:

...

On 5/19/20 15:23, Laurent Dumont wrote:

...
From what we can gather, there are a couple of parameters that be be tweaked.

1. host_subset_size (Return X number of host instead of 1?) 2. randomize_allocation_candidates (Not 100% on this one) 3. shuffle_best_same_weighed_hosts (Return a random of X number of computes if they are all equal (instance of the same list for all scheduling requests)) 4. max_attempts (how many times the Scheduler will try to fit the instance somewhere)

We've already raised "max_attempts" to 5 from the default of 3 and will raise it further. That said, what are the recommendations for the rest of the settings? We are not exactly concerned with stacking vs spreading (but that's always nice) of the instances but rather making sure deployments fail because of real reasons and not just because Nova/Scheduler keeps stepping on it's own toes.

This is something I've written in the past related to the anti-affinity piece of what you're describing, that might be of help:

https://bugzilla.redhat.com/show_bug.cgi?id=1780380#c4

Option (2) in your list only helps if you have > 1000 hosts in your deployment and you want to make sure resource provider candidates beyond the same first 1000 are regularly made available for scheduling (by randomizing before returning the top 1000 weighted hosts). The placement API will limit the maximum number of returned allocation candidates to 1000 for performance reasons.

And for reference, here is where the limit of 1000 results comes from, it is configurable: https://docs.openstack.org/nova/queens/configuration/config.html#scheduler.m...

...

Option (3) in your list only helps if you have lots of hosts being weighed equally and you need some randomization per exact weight to help prevent collisions. This is usually applicable to requests for certain NUMA topology and you get many hosts weighted equally.

Hope this helps, -melanie

Sean Mooney

4:03 a.m.

...

Hey everyone,

We are seeing a pretty consistent issue with Nova/Scheduler where some instances creation are hitting the "max_attempts" limits of the scheduler. Well the answer you are not going to like is nova is working as expected and we expect this to happen when you use multi create. placment help reduce the issue but

On Tue, 2020-05-19 at 18:23 -0400, Laurent Dumont wrote: there are some fundemtal issue with how we do retries that make this hard to fix. im not going to go into the detail right now as its not helpful but we have had quryies about this form customer in the past so fortunetly i do have some recomendation i can share https://bugzilla.redhat.com/show_bug.cgi?id=1759545#c8 well not that i have made the comment public i can :)

...

Env : Red Hat Queens Computes : All the same hardware and specs (even weight throughout) Nova : Three nova-schedulers

This can be due to two different factors (from what we've seen) :

- Anti-affinity rules are getting triggered during the creation (two claims are done within a few milliseconds on the same compute) which counts as a retry (we've seen this when spawning 40+ VMs in a single server group with maybe 50-55 computes - or even less 14 instances on 20ish computes).

yep the only way to completely avoid this issue on queens and depending on what fature you are using on master is to boot the vms serially waiting for each vm to sapwn.

...

- We've seen another case where MEMORY_MB becomes an issue (we are spinning new instances in the same host-aggregate where VMs are already running. Only one VM can run per compute but there are no anti-affinity groups to force that between the two deployments. The ressource requirements prevent anything else from getting spun on those). - The logs look like the following : - Unable to submit allocation for instance 659ef90e-33b8-42a9-9c8e-fac87278240d (409 {"errors": [{"status": 409, "request_id": "req-429c2734-2f2d-4d2d-82d1-fa4ebe12c991", "detail": "There was a conflict when trying to complete your request.\n\n Unable to allocate inventory: Unable to create allocation for 'MEMORY_MB' on resource provider '35b78f3b-8e59-4f2f-8cad-eaf116b7c1c7'. The requested amount would exceed the capacity. ", "title": "Conflict"}]}) / Setting instance to ERROR state.: MaxRetriesExceeded: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance f6d06cca-e9b5-4199-8220-e3ff2e5c2a41. in this case you are racing with ohter instance for the that host. basically when doing a multi create if any vm fails to boot it will go to the next host in the alternate host list and try to create an allcoation againt ther first host in that list.

...

- I do believe we are hitting this issue as well : https://bugs.launchpad.net/nova/+bug/1837955 - In all the cases where the Stacks creation failed, one instance was left in the Build state for 120 minutes and then finally failed.

From what we can gather, there are a couple of parameters that be be tweaked.

1. host_subset_size (Return X number of host instead of 1?) 2. randomize_allocation_candidates (Not 100% on this one) 3. shuffle_best_same_weighed_hosts (Return a random of X number of computes if they are all equal (instance of the same list for all scheduling requests)) 4. max_attempts (how many times the Scheduler will try to fit the instance somewhere)

We've already raised "max_attempts" to 5 from the default of 3 and will raise it further. That said, what are the recommendations for the rest of the settings? We are not exactly concerned with stacking vs spreading (but that's always nice) of the instances but rather making sure deployments fail because of real reasons and not just because Nova/Scheduler keeps stepping on it's own toes. https://bugzilla.redhat.com/show_bug.cgi?id=1759545#c8 has some suggestions but tl;dr it should be safe to set max_attempts=10 if you set subset_size=15 shuffle_best_same_weighed_hosts=true

however when the alternate host list was created none of the vms had been sapwned yet. by the time the rety arrive at the conductor one of the other vms could have been schduled to that host either as a first chose or because that other vm retried first and won the race. when this happens we then try the next host in the list wehre we can race again. since the retries happen at the cell conductor level without going back to the schduler again we are not going to check the current status of the host using the anti affintiy filter or anti affintiy weigher during the retry so while it was vaild intially i can be invalid when we try to use the alternate host. the only way to fix that is to have retrys not use alternate hosts and instead have each retry return the full scudling process so that it can make a desicion based on the current state of the server not the old view. that said i really would not put max_attempts over 10, max_attempts 5 should be more then enough. subset_size=15 is a little bit arbiraty. the best value will depend on the type ical size of your deplopyment and the size of your cloud. randomize_allocation_candidates help if and only if you have limite the number of allocation candiates retruned by placment to subset of your cloud hosts. e.g. if you set the placemment allcation candiate limit to 10 on for a cloud with 100 host then you should set randomize_allocation_candidates=true so that you do not get a bias that will pack host baded on the natural db order. the default limit for alloction candiates is 1000 so unless you have more then 1000 hosts or have changed that limit you do not need to set this.

...

Thanks!

Laurent Dumont

8:02 p.m.

Hey Melanie, Sean, Thank you! That should cover most of our uses cases. Is there any downside to a "subset_size" that would be larger than the actual number of computes? We have some env with 4 computes, and others with 100+. Laurent On Tue, May 19, 2020 at 7:33 PM Sean Mooney <smooney@redhat.com> wrote:

...

...
Hey everyone,

We are seeing a pretty consistent issue with Nova/Scheduler where some instances creation are hitting the "max_attempts" limits of the scheduler. Well the answer you are not going to like is nova is working as expected and we expect this to happen when you use multi create. placment help reduce the issue but

On Tue, 2020-05-19 at 18:23 -0400, Laurent Dumont wrote: there are some fundemtal issue with how we do retries that make this hard to fix.

im not going to go into the detail right now as its not helpful but we have had quryies about this form customer in the past so fortunetly i do have some recomendation i can share https://bugzilla.redhat.com/show_bug.cgi?id=1759545#c8 well not that i have made the comment public i can :)

...
Env : Red Hat Queens Computes : All the same hardware and specs (even weight throughout) Nova : Three nova-schedulers

This can be due to two different factors (from what we've seen) :

- Anti-affinity rules are getting triggered during the creation (two claims are done within a few milliseconds on the same compute) which

counts

...
as a retry (we've seen this when spawning 40+ VMs in a single server group with maybe 50-55 computes - or even less 14 instances on 20ish computes). yep the only way to completely avoid this issue on queens and depending on what fature you are using on master is to boot the vms serially waiting for each vm to sapwn.

...
- We've seen another case where MEMORY_MB becomes an issue (we are spinning new instances in the same host-aggregate where VMs are already running. Only one VM can run per compute but there are no anti-affinity groups to force that between the two deployments. The ressource requirements prevent anything else from getting spun on those). - The logs look like the following : - Unable to submit allocation for instance 659ef90e-33b8-42a9-9c8e-fac87278240d (409 {"errors": [{"status": 409, "request_id": "req-429c2734-2f2d-4d2d-82d1-fa4ebe12c991", "detail": "There was a conflict when trying to complete your request.\n\n Unable to allocate inventory: Unable to create allocation for 'MEMORY_MB' on resource provider '35b78f3b-8e59-4f2f-8cad-eaf116b7c1c7'. The requested amount would exceed the capacity. ", "title": "Conflict"}]}) / Setting instance to ERROR state.: MaxRetriesExceeded: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance f6d06cca-e9b5-4199-8220-e3ff2e5c2a41. in this case you are racing with ohter instance for the that host. basically when doing a multi create if any vm fails to boot it will go to the next host in the alternate host list and try to create an allcoation againt ther first host in that list.

however when the alternate host list was created none of the vms had been sapwned yet. by the time the rety arrive at the conductor one of the other vms could have been schduled to that host either as a first chose or because that other vm retried first and won the race.

when this happens we then try the next host in the list wehre we can race again.

since the retries happen at the cell conductor level without going back to the schduler again we are not going to check the current status of the host using the anti affintiy filter or anti affintiy weigher during the retry so while it was vaild intially i can be invalid when we try to use the alternate host.

...
- I do believe we are hitting this issue as well : https://bugs.launchpad.net/nova/+bug/1837955 - In all the cases where the Stacks creation failed, one instance was left in the Build state for 120 minutes and then finally failed.

From what we can gather, there are a couple of parameters that be be tweaked.

1. host_subset_size (Return X number of host instead of 1?) 2. randomize_allocation_candidates (Not 100% on this one) 3. shuffle_best_same_weighed_hosts (Return a random of X number of computes if they are all equal (instance of the same list for all scheduling requests)) 4. max_attempts (how many times the Scheduler will try to fit the instance somewhere)

We've already raised "max_attempts" to 5 from the default of 3 and will raise it further. That said, what are the recommendations for the rest of the settings? We are not exactly concerned with stacking vs spreading (but that's always nice) of the instances but rather making sure deployments fail because of real reasons and not just because Nova/Scheduler keeps stepping on it's own toes. https://bugzilla.redhat.com/show_bug.cgi?id=1759545#c8 has some suggestions but tl;dr it should be safe to set max_attempts=10 if you set subset_size=15 shuffle_best_same_weighed_hosts=true

the only way to fix that is to have retrys not use alternate hosts and instead have each retry return the full scudling process so that it can make a desicion based on the current state of the server not the old view. that said i really would not put max_attempts over 10, max_attempts 5 should be more then enough. subset_size=15 is a little bit arbiraty. the best value will depend on the type ical size of your deplopyment and the size of your cloud. randomize_allocation_candidates help if and only if you have limite the number of allocation candiates retruned by placment to subset of your cloud hosts.

e.g. if you set the placemment allcation candiate limit to 10 on for a cloud with 100 host then you should set randomize_allocation_candidates=true so that you do not get a bias that will pack host baded on the natural db order. the default limit for alloction candiates is 1000 so unless you have more then 1000 hosts or have changed that limit you do not need to set this.

...
Thanks!

Sean Mooney

10:25 p.m.

On Wed, 2020-05-20 at 11:32 -0400, Laurent Dumont wrote:

...

Hey Melanie, Sean,

Thank you! That should cover most of our uses cases. Is there any downside to a "subset_size" that would be larger than the actual number of computes? We have some env with 4 computes, and others with 100+. it will basicaly make the weigher irrelevent. when you use subset_size we basically select randomly form the first "subset_size" hosts in the list of host returned so so if subset_size is equal or large then the total number of host it will just be a random selection from the hosts that pass the filter/placment query.

so you want subset_size to be proportionally small (an order of mangniture or two smaller) compareed to the number of avlaible hosts and proptionally equivelent (within 1 order of mangniture or so) of your typeical concurrent instnace multi create request. you want it to be small relitive to the could so the that weigher remian statistally relevent and similar to the size of the multi create to make the proablity of the same host being selected for an instance low.

...

Laurent

On Tue, May 19, 2020 at 7:33 PM Sean Mooney <smooney@redhat.com> wrote:

...
On Tue, 2020-05-19 at 18:23 -0400, Laurent Dumont wrote:

...
Hey everyone,

We are seeing a pretty consistent issue with Nova/Scheduler where some instances creation are hitting the "max_attempts" limits of the

scheduler. Well the answer you are not going to like is nova is working as expected and we expect this to happen when you use multi create. placment help reduce the issue but there are some fundemtal issue with how we do retries that make this hard to fix.

im not going to go into the detail right now as its not helpful but we have had quryies about this form customer in the past so fortunetly i do have some recomendation i can share https://bugzilla.redhat.com/show_bug.cgi?id=1759545#c8 well not that i have made the comment public i can :)

...
Env : Red Hat Queens Computes : All the same hardware and specs (even weight throughout) Nova : Three nova-schedulers

This can be due to two different factors (from what we've seen) :

- Anti-affinity rules are getting triggered during the creation (two claims are done within a few milliseconds on the same compute) which

counts

...
as a retry (we've seen this when spawning 40+ VMs in a single server

group

...
with maybe 50-55 computes - or even less 14 instances on 20ish

computes). yep the only way to completely avoid this issue on queens and depending on what fature you are using on master is to boot the vms serially waiting for each vm to sapwn.

...
- We've seen another case where MEMORY_MB becomes an issue (we are spinning new instances in the same host-aggregate where VMs are

already

...
running. Only one VM can run per compute but there are no

anti-affinity

...
groups to force that between the two deployments. The ressource requirements prevent anything else from getting spun on those). - The logs look like the following : - Unable to submit allocation for instance 659ef90e-33b8-42a9-9c8e-fac87278240d (409 {"errors": [{"status":

409,

...
"request_id": "req-429c2734-2f2d-4d2d-82d1-fa4ebe12c991", "detail": "There was a conflict when trying to complete your request.\n\n Unable to allocate inventory: Unable to create allocation for 'MEMORY_MB' on resource provider '35b78f3b-8e59-4f2f-8cad-eaf116b7c1c7'. The requested amount would

exceed

...
the capacity. ", "title": "Conflict"}]}) / Setting instance to

ERROR

...
state.: MaxRetriesExceeded: Exceeded maximum number of retries.

Exhausted

...
all hosts available for retrying build failures for instance f6d06cca-e9b5-4199-8220-e3ff2e5c2a41.

in this case you are racing with ohter instance for the that host. basically when doing a multi create if any vm fails to boot it will go to the next host in the alternate host list and try to create an allcoation againt ther first host in that list.

however when the alternate host list was created none of the vms had been sapwned yet. by the time the rety arrive at the conductor one of the other vms could have been schduled to that host either as a first chose or because that other vm retried first and won the race.

when this happens we then try the next host in the list wehre we can race again.

since the retries happen at the cell conductor level without going back to the schduler again we are not going to check the current status of the host using the anti affintiy filter or anti affintiy weigher during the retry so while it was vaild intially i can be invalid when we try to use the alternate host.

the only way to fix that is to have retrys not use alternate hosts and instead have each retry return the full scudling process so that it can make a desicion based on the current state of the server not the old view.

...
- I do believe we are hitting this issue as well : https://bugs.launchpad.net/nova/+bug/1837955 - In all the cases where the Stacks creation failed, one instance

was

...
left in the Build state for 120 minutes and then finally failed.

From what we can gather, there are a couple of parameters that be be tweaked.

1. host_subset_size (Return X number of host instead of 1?) 2. randomize_allocation_candidates (Not 100% on this one) 3. shuffle_best_same_weighed_hosts (Return a random of X number of computes if they are all equal (instance of the same list for all scheduling requests)) 4. max_attempts (how many times the Scheduler will try to fit the instance somewhere)

We've already raised "max_attempts" to 5 from the default of 3 and will raise it further. That said, what are the recommendations for the rest of the settings? We are not exactly concerned with stacking vs spreading

(but

...
that's always nice) of the instances but rather making sure deployments fail because of real reasons and not just because Nova/Scheduler keeps stepping on it's own toes.

https://bugzilla.redhat.com/show_bug.cgi?id=1759545#c8 has some suggestions but tl;dr it should be safe to set max_attempts=10 if you set subset_size=15 shuffle_best_same_weighed_hosts=true that said i really would not put max_attempts over 10, max_attempts 5 should be more then enough. subset_size=15 is a little bit arbiraty. the best value will depend on the type ical size of your deplopyment and the size of your cloud. randomize_allocation_candidates help if and only if you have limite the number of allocation candiates retruned by placment to subset of your cloud hosts.

e.g. if you set the placemment allcation candiate limit to 10 on for a cloud with 100 host then you should set randomize_allocation_candidates=true so that you do not get a bias that will pack host baded on the natural db order. the default limit for alloction candiates is 1000 so unless you have more then 1000 hosts or have changed that limit you do not need to set this.

...
Thanks!

Laurent Dumont

25 May 25 May

9:17 p.m.

So we ended up increasing both max_attempts and host_subset_size and it fixed our issue. Hooray. I think I saw a KB from Red Hat on that exact issue - but I can't find the link anymore... Thank you to Sean and Melanie! :) On Wed, May 20, 2020 at 1:55 PM Sean Mooney <smooney@redhat.com> wrote:

...

On Wed, 2020-05-20 at 11:32 -0400, Laurent Dumont wrote:

...
Hey Melanie, Sean,

Thank you! That should cover most of our uses cases. Is there any downside to a "subset_size" that would be larger than the actual number of computes? We have some env with 4 computes, and others with 100+. it will basicaly make the weigher irrelevent. when you use subset_size we basically select randomly form the first "subset_size" hosts in the list of host returned so so if subset_size is equal or large then the total number of host it will just be a random selection from the hosts that pass the filter/placment query.

so you want subset_size to be proportionally small (an order of mangniture or two smaller) compareed to the number of avlaible hosts and proptionally equivelent (within 1 order of mangniture or so) of your typeical concurrent instnace multi create request.

you want it to be small relitive to the could so the that weigher remian statistally relevent and similar to the size of the multi create to make the proablity of the same host being selected for an instance low.

...
Laurent

On Tue, May 19, 2020 at 7:33 PM Sean Mooney <smooney@redhat.com> wrote:

...
On Tue, 2020-05-19 at 18:23 -0400, Laurent Dumont wrote:

...
Hey everyone,

We are seeing a pretty consistent issue with Nova/Scheduler where

...
...
...
instances creation are hitting the "max_attempts" limits of the

scheduler. Well the answer you are not going to like is nova is working as expected and we expect this to happen when you use multi create. placment help reduce the issue but there are some fundemtal issue with how we do retries that make this hard to fix.

im not going to go into the detail right now as its not helpful but we have had quryies about this form customer in the past so fortunetly i do have some recomendation i can share https://bugzilla.redhat.com/show_bug.cgi?id=1759545#c8 well not that i have made the comment public i can :)

...
Env : Red Hat Queens Computes : All the same hardware and specs (even weight throughout) Nova : Three nova-schedulers

This can be due to two different factors (from what we've seen) :

- Anti-affinity rules are getting triggered during the creation

(two

...
claims are done within a few milliseconds on the same compute) which

counts

...
as a retry (we've seen this when spawning 40+ VMs in a single server

group

...
with maybe 50-55 computes - or even less 14 instances on 20ish

computes). yep the only way to completely avoid this issue on queens and depending on what fature you are using on master is to boot the vms serially waiting for each vm to sapwn.

...
- We've seen another case where MEMORY_MB becomes an issue (we are spinning new instances in the same host-aggregate where VMs are

already

...
running. Only one VM can run per compute but there are no

anti-affinity

...
groups to force that between the two deployments. The ressource requirements prevent anything else from getting spun on those). - The logs look like the following : - Unable to submit allocation for instance 659ef90e-33b8-42a9-9c8e-fac87278240d (409 {"errors": [{"status":

409,

...
"request_id": "req-429c2734-2f2d-4d2d-82d1-fa4ebe12c991", "detail": "There was a conflict when trying to complete your request.\n\n Unable to allocate inventory: Unable to create allocation for 'MEMORY_MB' on resource provider '35b78f3b-8e59-4f2f-8cad-eaf116b7c1c7'. The requested amount

would

exceed

...
the capacity. ", "title": "Conflict"}]}) / Setting instance to

ERROR

...
state.: MaxRetriesExceeded: Exceeded maximum number of retries.

Exhausted

...
all hosts available for retrying build failures for instance f6d06cca-e9b5-4199-8220-e3ff2e5c2a41.

in this case you are racing with ohter instance for the that host. basically when doing a multi create if any vm fails to boot it will go to the next host in the alternate host list and try to create an allcoation againt ther first host in that list.

however when the alternate host list was created none of the vms had been sapwned yet. by the time the rety arrive at the conductor one of the other vms could have been schduled to that host either as a first chose or because that other vm retried first and won the race.

when this happens we then try the next host in the list wehre we can race again.

since the retries happen at the cell conductor level without going back to the schduler again we are not going to check the current status of the host using the anti affintiy filter or anti affintiy weigher during the retry so while it was vaild intially i can be invalid when we try to use the alternate host.

the only way to fix that is to have retrys not use alternate hosts and instead have each retry return the full scudling process so that it can make a desicion based on the current state of

...
...
server not the old view.

...
- I do believe we are hitting this issue as well : https://bugs.launchpad.net/nova/+bug/1837955 - In all the cases where the Stacks creation failed, one instance

was

...
left in the Build state for 120 minutes and then finally

failed.

...
From what we can gather, there are a couple of parameters that be be tweaked.

1. host_subset_size (Return X number of host instead of 1?) 2. randomize_allocation_candidates (Not 100% on this one) 3. shuffle_best_same_weighed_hosts (Return a random of X number of computes if they are all equal (instance of the same list for all scheduling requests)) 4. max_attempts (how many times the Scheduler will try to fit the instance somewhere)

We've already raised "max_attempts" to 5 from the default of 3 and

will

...
raise it further. That said, what are the recommendations for the rest of the settings? We are not exactly concerned with stacking vs spreading

(but

...
that's always nice) of the instances but rather making sure deployments fail because of real reasons and not just because Nova/Scheduler keeps stepping on it's own toes.

https://bugzilla.redhat.com/show_bug.cgi?id=1759545#c8 has some suggestions but tl;dr it should be safe to set max_attempts=10 if you set subset_size=15 shuffle_best_same_weighed_hosts=true that said i really would not put max_attempts over 10, max_attempts 5 should be more then enough. subset_size=15 is a little bit arbiraty. the best value will depend on

some the the

...
...
type ical size of your deplopyment and the size of your cloud. randomize_allocation_candidates help if and only if you have limite the number of allocation candiates retruned by placment to subset of your cloud hosts.

e.g. if you set the placemment allcation candiate limit to 10 on for a cloud with 100 host then you should set randomize_allocation_candidates=true so that you do not get a bias that will pack host baded on the natural db order. the default limit for alloction candiates is 1000 so unless you have more then 1000 hosts or have changed that limit you do not need to set this.

...
Thanks!

1904

Age (days ago)

1910

Last active (days ago)

List overview

Download

6 comments

3 participants

participants (3)

Laurent Dumont
melanie witt
Sean Mooney