[openstack-dev] Scheduler proposal

Boris Pavlovic boris at pavlovic.me
Sun Oct 11 07:02:39 UTC 2015


2Everybody,

Just curios why we need such complexity.


Let's take a look from other side:
1) Information about all hosts (even in case of 100k hosts) will be less
then 1 GB
2) Usually servers that runs scheduler service have at least 64GB RAM and
more on the board
3) math.log(100000) < 12  (binary search per rule)
4) We have less then 20 rules for scheduling
5) Information about hosts is updated every 60 seconds (no updates host is
dead)


According to this information:
1) We can store everything in RAM of single server
2) We can use Python
3) Information about hosts is temporary data and shouldn't be stored in
persistence storage


Simplest architecture to cover this:
1) Single RPC service that has two methods: find_host(rules),
update_host(host, data)
2) Store information about hosts  like a dict (host_name->data)
3) Create for each rule binary tree and update it on each host update
4) Make a algorithm that will use binary trees to find host based on rules
5) Each service like compute node, volume node, or neutron will send
updates about host
   that they managed (cross service scheduling)
6) Make a algorithm that will sync host stats in memory between different
schedulers
7) ...
8) PROFIT!

It's:
1) Simple to manage
2) Simple to understand
3) Simple to calc scalability limits
4) Simple to integrate in current OpenStack architecture


As a future bonus, we can implement scheduler-per-az functionality, so each
scheduler will store information
only about his AZ, and separated AZ can have own rabbit servers for example
which will allows us to get
horizontal scalability in terms of AZ.


So do we really need Cassandra, Mongo, ... and other Web-scale solution for
such simple task?


Best regards,
Boris Pavlovic

On Sat, Oct 10, 2015 at 11:19 PM, Clint Byrum <clint at fewbar.com> wrote:

> Excerpts from Chris Friesen's message of 2015-10-09 23:16:43 -0700:
> > On 10/09/2015 07:29 PM, Clint Byrum wrote:
> >
> > > Even if you figured out how to make the in-memory scheduler crazy fast,
> > > There's still value in concurrency for other reasons. No matter how
> > > fast you make the scheduler, you'll be slave to the response time of
> > > a single scheduling request. If you take 1ms to schedule each node
> > > (including just reading the request and pushing out your scheduling
> > > result!) you will never achieve greater than 1000/s. 1ms is way lower
> > > than it's going to take just to shove a tiny message into RabbitMQ or
> > > even 0mq. So I'm pretty sure this is o-k for small clouds, but would be
> > > a disaster for a large, busy cloud.
> > >
> > > If, however, you can have 20 schedulers that all take 10ms on average,
> > > and have the occasional lock contention for a resource counter
> resulting
> > > in 100ms, now you're at 2000/s minus the lock contention rate. This
> > > strategy would scale better with the number of compute nodes, since
> > > more nodes means more distinct locks, so you can scale out the number
> > > of running servers separate from the number of scheduling requests.
> >
> > As far as I can see, moving to an in-memory scheduler is essentially
> orthogonal
> > to allowing multiple schedulers to run concurrently.  We can do both.
> >
>
> Agreed, and I want to make sure we continue to be able to run concurrent
> schedulers.
>
> Going in memory won't reduce contention for the same resources. So it
> will definitely schedule faster, but it may also serialize with concurrent
> schedulers sooner, and thus turn into a situation where scaling out more
> nodes means the same, or even less throughput.
>
> Keep in mind, I actually think we give our users _WAY_ too much power
> over our clouds, and I actually think we should simply have flavor based
> scheduling and let compute nodes grab node reservation requests directly
> out of flavor based queues based on their own current observation of
> their ability to service it.
>
> But I understand that there are quite a few clouds now that have been
> given shiny dynamic scheduling tools and now we have to engineer for
> those.
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20151011/f36482e9/attachment.html>


More information about the OpenStack-dev mailing list