[openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

Cheng, Yingxin yingxin.cheng at intel.com
Wed Feb 24 15:06:05 UTC 2016

Sorry for the late reply.

On 22 February 2016 at 18:45, John Garbutt wrote:
> On 21 February 2016 at 13:51, Cheng, Yingxin <yingxin.cheng at intel.com> wrote:
> > On 19 February 2016 at 5:58, John Garbutt wrote:
> >> On 17 February 2016 at 17:52, Clint Byrum <clint at fewbar.com> wrote:
> >> > Excerpts from Cheng, Yingxin's message of 2016-02-14 21:21:28 -0800:
> >> Long term, I see a world where there are multiple scheduler Nova is
> >> able to use, depending on the deployment scenario.
> >
> > Technically, what I've implemented is a new type of scheduler host
> > manager `shared_state_manager.SharedHostManager`[1] with the ability
> > to synchronize host states directly from resource trackers.
> Thats fine. You just get to re-use more code.
> Maybe I should say multiple scheduling strategies, or something like that.
> >> So a big question for me is, does the new scheduler interface work if
> >> you look at slotting in your prototype scheduler?
> >>
> >> Specifically I am thinking about this interface:
> >> https://github.com/openstack/nova/blob/master/nova/scheduler/client/_
> >> _init__.py
> I am still curious if this interface is OK for your needs?

The added interfaces from scheduler side is:
1. I can remove "notify_schedulers" because the same message can be sent through "send_commit" instead.
2. The "send_commit" interface is required because there should be a way to send state updates from compute node to a specific scheduler.

The added/changed interfaces from compute side is:
1. The "report_host_state" interface is required. When a scheduler is up, it must ask compute node for the latest host state. It is also required when the scheduler detects that it's host state is out of sync and it should ask compute node for a synced state(its rare due to network issues or bugs).
2. The new parameter "claim" should be added to interface "build_and_run_instance" because compute node should reply whether a scheduler claim is successful. Scheduler can thus track its claims and can be updated by successful claims from other schedulers immediately. The compute node can thus decide whether a scheduler decision is made from the "shared-state" scheduler, that's the *tricky* part to support both types of schedulers.

> Making this work across both types of scheduler might be tricky, but I think it is
> worthwhile.
> >> > This mostly agrees with recent tests I've been doing simulating
> >> > 1000 compute nodes with the fake virt driver.
> >>
> >> Overall this agrees with what I saw in production before moving us to
> >> the caching scheduler driver.
> >>
> >> I would love a nova functional test that does that test. It will help
> >> us compare these different schedulers and find the strengths and weaknesses.
> >
> > I'm also working on implementing the functional tests of nova
> > scheduler, there is a patch showing my latest progress:
> > https://review.openstack.org/#/c/281825/
> >
> > IMO scheduler functional tests are not good at testing real
> > performance of different schedulers, because all of the services are
> > running as green threads instead of real processes. I think the better
> > way to analysis the real performance and the strengths and weaknesses
> > is to start services in different processes with fake virt driver(i.e.
> > Clint Byrum's work) or Jay Pipe's work in emulating different designs.
> Having an option to run multiple process seems OK, if its needed.
> Although starting with a greenlet version that works in the gate seems best.
> Lets try a few things, and see what predicts the results in real environments.


> >> I am really interested how your prototype and the caching scheduler compare?
> >> It looks like single node scheduler will perform in a very similar
> >> way, but multiple schedulers are less likely to race each other,
> >> although there are quite a few races?
> >
> > I think the major weakness of caching scheduler comes from its host
> > state update model, i.e. updating host states from db every `
> > CONF.scheduler_driver_task_period`
> > seconds.
> The trade off is that consecutive scheduler decisions don't race each other, at all.
> Say you have a burst of 1000 instance builds and you want to avoid build failures
> (but accept sub optimal placement, and you are using fill first), thats a very good
> trade off.
> Consider a burst of 1000 deletes, it may take you 60 seconds to notice they are
> all deleted and you have lots more free space, but that doesn't cause build
> failures like excessive races for the same resources will, at least under the usual
> conditions where you are not yet totally full (i.e. non-HPC use cases).
> I was shocked how well the caching_scheduler works in practice. I assumed it
> would be terrible, but when we tried it, it worked well.
> Its a million miles from perfect, but handy for many deployment scenarios.

Yeah, it works well. But you know that : )

> Thanks,
> johnthetubaguy
> PS
> If you need a 1000 node test cluster to play with, its worth applying to use this
> one:
> http://go.rackspace.com/developercloud
> I am happy to recommend these efforts gets some time with that hardware.

Many thanks sincerely.

According to current situation, I prefer to test the prototype with compute service processes using fake drivers first. The scheduler performance result is purer and there's no need to synchronize code due to the newly fixed bugs, and no need to collect logs between VMs for the quick analysis.

When the code is stable enough, its real performance improvement is worthwhile to be confirmed using the powerful 1000 node test cluster.


More information about the OpenStack-dev mailing list