<div class="gmail_quote"><div>I have implemented a (single-node) constraint-based / rules-based scheduler that attempts to find a "good" solution to potentially conflicting rules.  I used it to implement eday's "openstack:location=machine1.rack1.room1.dfw" type pragma that we discussed in the past.  I think this could helpful for what you're describing here, so I encourage you to check it out:</div>

<div><a href="https://code.launchpad.net/~justin-fathomdb/nova/constraint-scheduler">https://code.launchpad.net/~justin-fathomdb/nova/constraint-scheduler</a></div><div><br></div><div>(I broke the unit tests while implementing the directed-location constraint in the derived branch, I'm going to fix that today)</div>

<div><br></div><div>As for the distributed scheduling approach, I like it.  I'd like to focus first on the conceptual approach:</div><div><ul><li>A scheduler receives a "allocation" request.</li><li>It evaluates it against all local providers, giving each one a "score"</li>

<li>It collects the responses from recursively sending the request to any child schedulers</li><li>It aggregates all these responses and selects the highest scoring node</li><li>It sends a "go-with-allocate" to the appropriate child scheduler (or does it locally)</li>

<li>If the selected node is no longer available, we start again</li></ul></div><div><br></div><div>Now, there are several optimizations:</div><div><ul><li>We can use a "clever" solver such that we don't need to evaluate against every local node (this is in my branch)</li>

<li>We may only return the top N solutions to minimize the data we pass around</li><li>We can try to limit the number of child schedulers to whom we forward the request:</li><ul><li>We may use zones information to rule out a child entirely</li>

<li>We may use other static information to rule out a child (e.g. a particular child might not be "HIPAA compliant")</li><li>We may choose to do this heuristically, for example, sending to a first child, and then considering whether we have found a 'good enough' response to stop polling children</li>

<li>We may choose to forward to the "most likely candidate" child schedulers first, to make the previous optimization more valid</li></ul><li>We may use a threshold search i.e. send a first request with a "perfect matches only" criteria, and then gradually repeat with more and more relaxed criteria</li>

</ul></div><div>All standard CS stuff.  However, we should not let the existence of the optimizations get in the way of implementing a "brute force" implementation first, where every request is fully evaluated on every provider in the entire scheduler tree.  For Cactus-size deployments, this will still be more than fast enough, and we should be able to get it merged in time.  When we launch an instance we're probably going to be copying a gigabyte of data around, so these optimizations really aren't too important in that light.  This also structures our development - the optimizations can be implemented in separate manageable patches.</div>

<div><br></div><div>This also helps me think about what data the parent schedulers need about their children.  In the "brute force" implementation schedulers only need the list of their children.  If we want to start to do filtering of requests, schedulers need appropriate static metadata (child zone information, HIPAA compliance).  Dynamic information (e.g. real-time availability) may be used to intelligently order the child requests, but it shouldn't matter if this dynamic information is out of date, because we're already bailing out when something is 'good enough' and not really looking for the 'optimal node'.  Out-of-date dynamic information will reduce efficiency (I may poll the wrong child first) but should not affect correctness.  But for Cactus, I just need the list of my children.</div>

<div><br></div><div>What I would really like to see is the ability to use the scheduler to combine clouds not under the same control.  For example, a private cloud could burst onto one or more public clouds; all under the control of a local scheduler.  This needs a few things:</div>

<div><ol><li>The schedulers should communicate with each other over HTTP, and can't really use the message queues because of the tight coupling needed</li><li>The public API interface should expose the same HTTP interface, so that it can be used as a child scheduler</li>

<li>We obviously can't rely on a centralized database</li></ol></div><div>(I don't really understand where the need (or desire) for a centralized database comes from ?)</div><div><br></div><div>Of course, we won't get to my multi-cloud dream in Cactus, because we have to discuss it and not just implement it.  Nonetheless, I see this approach as (1) similar to what you're suggesting, (2) simplifying the coding work, and (3) taking us to a great place.</div>

<div><br></div><div><br></div><div>Justin </div><div><br></div><div><br></div></div>