<html><head><meta http-equiv="Content-Type" content="text/html charset=windows-1252"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><br><div><div>On May 21, 2013, at 3:27 PM, "Day, Phil" <<a href="mailto:philip.day@hp.com">philip.day@hp.com</a>> wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div lang="EN-GB" link="blue" vlink="purple" style="font-family: Helvetica; font-size: medium; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-align: -webkit-auto; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; "><div class="WordSection1" style="page: WordSection1; "><div style="margin: 0cm 0cm 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; ">Hi Folks,<o:p></o:p></div><div style="margin: 0cm 0cm 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; "><o:p> </o:p></div><div style="margin: 0cm 0cm 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; ">I wondered if anyone else has managed to run multiple filter-schedulers concurrently under a high load ?<o:p></o:p></div><div style="margin: 0cm 0cm 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; "><o:p> </o:p></div><div style="margin: 0cm 0cm 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; ">I’d thought that the race conditions that we had in the past (where multiple schedulers pick the same host) been eliminated through the reworking of the resource tracker / retry mechanism, but whilst it is much better I still see the odd case where a request gets rejected multiple times (and eventually fails) because on each successive host it fails to get the resources the scheduler thought were there.<o:p></o:p></div><div style="margin: 0cm 0cm 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; "><o:p> </o:p></div><div style="margin: 0cm 0cm 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; ">I guess on reflection its implicit in any solution which relies on fail / retry approach to cover the race condition that under a large load the number of retries for any specific request is effectively unlimited, and so no value of max_retries is ever going to be quite big enough – but before I do some more head scratching about how (if) to try and make this more robust under load I’d see if others have approach this I thought I’d ask if others were also trying to rune more than one active scheduler.<o:p></o:p></div></div></div></blockquote><br></div><div><br></div><div>Yeah, multiple schedulers are a problem (heck, even a single one is under load :).   There's a config item that may help you:</div><div><br></div><div>scheduler_host_subset_size --  It defaults to '1'… but if you set it higher than 1, it'll randomize the top 'x' hosts.  This can help reduce races by introducing a bit of randomization.</div><div><br></div><div>Also, I feel like when we have conductor managing the retries, things can get a little better.  Perhaps we can bump the retries, I dunno.  Are you finding your computes kicking the messages back to the scheduler quickly…Ie, nova-compute is detecting quickly that an instance doesn't fit?  The resource tracker is supposed to be good about that.  If that is working well, you can probably safely bump the # of retries now… and be sure to use that above conf item.</div><div><br></div><div>- Chris</div><div><br></div><div><br></div><div><br></div><div><br></div><br></body></html>