<div dir="ltr">On 10 October 2015 at 23:47, Clint Byrum <span dir="ltr"><<a href="mailto:clint@fewbar.com" target="_blank">clint@fewbar.com</a>></span> wrote:<br><div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">> Per before, my suggestion was that every scheduler tries to maintain a copy<br>

> of the cloud's state in memory (in much the same way, per the previous<br>

> example, as every router on the internet tries to make a route table out of<br>

> what it learns from BGP).  They don't have to be perfect.  They don't have<br>

> to be in sync.  As long as there's some variability in the decision making,<br>

> they don't have to update when another scheduler schedules something (and<br>

> you can make the compute node send an immediate update when a new VM is<br>

> run, anyway).  They all stand a good chance of scheduling VMs well<br>

> simultaneously.<br>

><br>

<br>

</span>I'm quite in favor of eventual consistency and retries. Even if we had<br>

a system of perfect updating of all state records everywhere, it would<br>

break sometimes and I'd still want to not trust any record of state as<br>

being correct for the entire distributed system. However, there is an<br>

efficiency win gained by staying _close_ to correct. It is actually a<br>

function of the expected entropy. The more concurrent schedulers, the<br>

more entropy there will be to deal with.<br></blockquote><div><br></div><div>... and the fewer the servers in total, the larger the entropy as a proportion of the whole system (if that's a thing, it's a long time since I did physical chemistry).  But consider the use cases:<br><br></div><div>1. I have a small cloud, I run two schedulers for redundancy.  There's a good possibility that, when the cloud is loaded, the schedulers make poor decisions occasionally.  We'd have to consider how likely that was, certainly.<br><br></div><div>2. I have a large cloud, and I run 20 schedulers for redundancy.  There's a good chance that a scheduler is out of date on its information.  But there could be several hundred hosts willing to satisfy a scheduling request, and even of the ones with incorrect information a low chance that any of those are close to the threshold where they won't run the VM in question, so good odds it will pick a host that's happy to satsify the request.<br><br><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">

> But to be fair, we're throwing made up numbers around at this point.  Maybe<br>

> it's time to work out how to test this for scale in a harness - which is<br>

> the bit of work we all really need to do this properly, or there's no proof<br>

> we've actually helped - and leave people to code their ideas up?<br>

<br>

</span>I'm working on adding meters for rates and amounts of messages and<br>

queries that the system does right now for performance purposes. Rally<br>

though, is the place where I'd go to ask "how fast can we schedule things<br>

right now?".<br></blockquote><br></div><div class="gmail_quote">My only concern is that we're testing a real cloud at scale and I haven't got any more firstborn to sell for hardware, so I wonder if we can fake up a compute node in our test harness.<br>-- <br></div><div class="gmail_quote">Ian.<br></div></div></div>