[openstack-dev] A simple way to improve nova scheduler

Joe Gordon joe.gordon0 at gmail.com
Fri Jul 19 23:14:10 UTC 2013


On Fri, Jul 19, 2013 at 3:13 PM, Sandy Walsh <sandy.walsh at rackspace.com>wrote:

>
>
> On 07/19/2013 05:36 PM, Boris Pavlovic wrote:
> > Sandy,
> >
> > I don't think that we have such problems here.
> > Because scheduler doesn't pool compute_nodes.
> > The situation is another compute_nodes notify scheduler about their
> > state. (instead of updating their state in DB)
> >
> > So for example if scheduler send request to compute_node, compute_node
> > is able to run rpc call to schedulers immediately (not after 60sec).
> >
> > So there is almost no races.
>
> There are races that occur between the eventlet request threads. This is
> why the scheduler has been switched to single threaded and we can only
> run one scheduler.
>
> This problem may have been eliminated with the work that Chris Behrens
> and Brian Elliott were doing, but I'm not sure.
>


Speaking of Chris Beherns  "Relying on anything but the DB for current
memory free, etc, is just too laggy… so we need to stick with it, IMO."
http://lists.openstack.org/pipermail/openstack-dev/2013-June/010485.html

Although there is some elegance to the proposal here I have some concerns.

If just using RPC broadcasts from compute to schedulers to keep track of
things, we get two issues:

* How do you bring a new scheduler up in an existing deployment and make it
get the full state of the system?
* Broadcasting RPC updates from compute nodes to the scheduler means every
scheduler has to process  the same RPC message.  And if a deployment hits
the point where the number of compute updates is consuming 99 percent of
the scheduler's time just adding another scheduler won't fix anything as it
will get bombarded too.

Also OpenStack is already deeply invested in using the central DB model for
the state of the 'world' and while I am not against changing that, I think
we should evaluate that switch in a larger context.



>
> But certainly, the old approach of having the compute node broadcast
> status every N seconds is not suitable and was eliminated a long time ago.
>
> >
> >
> > Best regards,
> > Boris Pavlovic
> >
> > Mirantis Inc.
> >
> >
> >
> > On Sat, Jul 20, 2013 at 12:23 AM, Sandy Walsh <sandy.walsh at rackspace.com
> > <mailto:sandy.walsh at rackspace.com>> wrote:
> >
> >
> >
> >     On 07/19/2013 05:01 PM, Boris Pavlovic wrote:
> >     > Sandy,
> >     >
> >     > Hm I don't know that algorithm. But our approach doesn't have
> >     > exponential exchange.
> >     > I don't think that in 10k nodes cloud we will have a problems with
> 150
> >     > RPC call/sec. Even in 100k we will have only 1.5k RPC call/sec.
> >     > More then (compute nodes update their state in DB through conductor
> >     > which produce the same count of RPC calls).
> >     >
> >     > So I don't see any explosion here.
> >
> >     Sorry, I was commenting on Soren's suggestion from way back
> (essentially
> >     listening on a separate exchange for each unique flavor ... so no
> >     scheduler was needed at all). It was a great idea, but fell apart
> rather
> >     quickly.
> >
> >     The existing approach the scheduler takes is expensive (asking the db
> >     for state of all hosts) and polling the compute nodes might be
> do-able,
> >     but you're still going to have latency problems waiting for the
> >     responses (the states are invalid nearly immediately, especially if a
> >     fill-first scheduling algorithm is used). We ran into this problem
> >     before in an earlier scheduler implementation. The round-tripping
> kills.
> >
> >     We have a lot of really great information on Host state in the form
> of
> >     notifications right now. I think having a service (or notification
> >     driver) listening for these and keeping an the HostState
> incrementally
> >     updated (and reported back to all of the schedulers via the fanout
> >     queue) would be a better approach.
> >
> >     -S
> >
> >
> >     >
> >     > Best regards,
> >     > Boris Pavlovic
> >     >
> >     > Mirantis Inc.
> >     >
> >     >
> >     > On Fri, Jul 19, 2013 at 11:47 PM, Sandy Walsh
> >     <sandy.walsh at rackspace.com <mailto:sandy.walsh at rackspace.com>
> >     > <mailto:sandy.walsh at rackspace.com
> >     <mailto:sandy.walsh at rackspace.com>>> wrote:
> >     >
> >     >
> >     >
> >     >     On 07/19/2013 04:25 PM, Brian Schott wrote:
> >     >     > I think Soren suggested this way back in Cactus to use MQ
> >     for compute
> >     >     > node state rather than database and it was a good idea then.
> >     >
> >     >     The problem with that approach was the number of queues went
> >     exponential
> >     >     as soon as you went beyond simple flavors. Add Capabilities or
> >     other
> >     >     criteria and you get an explosion of exchanges to listen to.
> >     >
> >     >
> >     >
> >     >     > On Jul 19, 2013, at 10:52 AM, Boris Pavlovic
> >     <boris at pavlovic.me <mailto:boris at pavlovic.me>
> >     >     <mailto:boris at pavlovic.me <mailto:boris at pavlovic.me>>
> >     >     > <mailto:boris at pavlovic.me <mailto:boris at pavlovic.me>
> >     <mailto:boris at pavlovic.me <mailto:boris at pavlovic.me>>>> wrote:
> >     >     >
> >     >     >> Hi all,
> >     >     >>
> >     >     >>
> >     >     >> In Mirantis Alexey Ovtchinnikov and me are working on nova
> >     scheduler
> >     >     >> improvements.
> >     >     >>
> >     >     >> As far as we can see the problem, now scheduler has two
> >     major issues:
> >     >     >>
> >     >     >> 1) Scalability. Factors that contribute to bad scalability
> >     are these:
> >     >     >> *) Each compute node every periodic task interval (60 sec
> >     by default)
> >     >     >> updates resources state in DB.
> >     >     >> *) On every boot request scheduler has to fetch information
> >     about all
> >     >     >> compute nodes from DB.
> >     >     >>
> >     >     >> 2) Flexibility. Flexibility perishes due to problems with:
> >     >     >> *) Addiing new complex resources (such as big lists of
> complex
> >     >     objects
> >     >     >> e.g. required by PCI Passthrough
> >     >     >>
> >     >
> >
> https://review.openstack.org/#/c/34644/5/nova/db/sqlalchemy/models.py)
> >     >     >> *) Using different sources of data in Scheduler for example
> >     from
> >     >     >> cinder or ceilometer.
> >     >     >> (as required by Volume Affinity Filter
> >     >     >> https://review.openstack.org/#/c/29343/)
> >     >     >>
> >     >     >>
> >     >     >> We found a simple way to mitigate this issues by avoiding
> >     of DB usage
> >     >     >> for host state storage.
> >     >     >>
> >     >     >> A more detailed discussion of the problem state and one of
> >     a possible
> >     >     >> solution can be found here:
> >     >     >>
> >     >     >>
> >     >
> >
> https://docs.google.com/document/d/1_DRv7it_mwalEZzLy5WO92TJcummpmWL4NWsWf0UWiQ/edit#
> >     >     >>
> >     >     >>
> >     >     >> Best regards,
> >     >     >> Boris Pavlovic
> >     >     >>
> >     >     >> Mirantis Inc.
> >     >     >>
> >     >     >> _______________________________________________
> >     >     >> OpenStack-dev mailing list
> >     >     >> OpenStack-dev at lists.openstack.org
> >     <mailto:OpenStack-dev at lists.openstack.org>
> >     >     <mailto:OpenStack-dev at lists.openstack.org
> >     <mailto:OpenStack-dev at lists.openstack.org>>
> >     >     >> <mailto:OpenStack-dev at lists.openstack.org
> >     <mailto:OpenStack-dev at lists.openstack.org>
> >     >     <mailto:OpenStack-dev at lists.openstack.org
> >     <mailto:OpenStack-dev at lists.openstack.org>>>
> >     >     >>
> >     http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >     >     >
> >     >     >
> >     >     >
> >     >     > _______________________________________________
> >     >     > OpenStack-dev mailing list
> >     >     > OpenStack-dev at lists.openstack.org
> >     <mailto:OpenStack-dev at lists.openstack.org>
> >     >     <mailto:OpenStack-dev at lists.openstack.org
> >     <mailto:OpenStack-dev at lists.openstack.org>>
> >     >     >
> >     http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >     >     >
> >     >
> >     >     _______________________________________________
> >     >     OpenStack-dev mailing list
> >     >     OpenStack-dev at lists.openstack.org
> >     <mailto:OpenStack-dev at lists.openstack.org>
> >     >     <mailto:OpenStack-dev at lists.openstack.org
> >     <mailto:OpenStack-dev at lists.openstack.org>>
> >     >
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >     >
> >     >
> >     >
> >     >
> >     > _______________________________________________
> >     > OpenStack-dev mailing list
> >     > OpenStack-dev at lists.openstack.org
> >     <mailto:OpenStack-dev at lists.openstack.org>
> >     > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >     >
> >
> >     _______________________________________________
> >     OpenStack-dev mailing list
> >     OpenStack-dev at lists.openstack.org
> >     <mailto:OpenStack-dev at lists.openstack.org>
> >     http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
> >
> >
> >
> > _______________________________________________
> > OpenStack-dev mailing list
> > OpenStack-dev at lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20130719/dcdb577b/attachment.html>


More information about the OpenStack-dev mailing list