[openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

Jay Pipes jaypipes at gmail.com
Sun Feb 21 19:56:04 UTC 2016

Yingxin, sorry for the delay in responding to this thread. My comments 

On 02/17/2016 12:45 AM, Cheng, Yingxin wrote:
> To better illustrate the differences between shared-state,
> resource-provider and legacy scheduler, I’ve drew 3 simplified pictures
> [1] in emphasizing the location of resource view, the location of claim
> and resource consumption, and the resource update/refresh pattern in
> three kinds of schedulers. Hoping I’m correct in the “resource-provider
> scheduler” part.

No, the diagram is not correct for the resource-provider scheduler.

Problems with your depiction of the resource-provider scheduler:

1) There is no proposed cache at all in the resource-provider scheduler 
so all the arrows for "cache refresh" can be eliminated.

2) Claims of resource amounts are done in a database transaction 
atomically within each scheduler process. Therefore there are no "cache 
updates" arrows going back from compute nodes to the resource-provider 
DB. The only time a compute node would communicate with the 
resource-provider DB (and thus the scheduler at all) would be in the 
case of a *failed* attempt to initialize already-claimed resources.

> A point of view from my analysis in comparing three schedulers (before
> real experiment):
> 1. Performance: The performance bottlehead of resource-provider and
> legacy scheduler is from the centralized db and scheduler cache
> refreshing.

You must first prove that there is a bottleneck with the 
resource-provider scheduler.

 > It can be alleviated by changing to a stand-alone high
>  performance database.

It doesn't need to be high-performance at all. In my benchmarks, a 
small-sized stock MySQL database server is able to fulfill thousands of 
placement queries and claim transactions per minute using completely 
isolated non-shared, non-caching scheduler processes.

 > And the cache refreshing is designed to be
> replaced by to direct SQL queries according to resource-provider
> scheduler spec [2].

Yes, this is correct.

 > The performance bottlehead of shared-state scheduler
> may come from the overwhelming update messages, it can also be
> alleviated by changing to stand-alone distributed message queue and by
> using the “MessagePipe” to merge messages.

In terms of the number of messages used in each design, I see the 
following relationship:

resource-providers < legacy < shared-state-scheduler

would you agree with that?

The resource-providers proposal actually uses no update messages at all 
(except in the abnormal case of a compute node failing to start the 
resources that had previously been claimed by the scheduler). All 
updates are done in a single database transaction when the claim is made.

The legacy scheduler has each compute node sending an update message 
(actually it's a database update in the form of ComputeNode.save() that 
is done at the completion of the local nova.compute.claims.Claim() 
context manager. In addition to these update messages, the legacy 
scheduler has a problem with retries (because the scheduler operates on 
non-fresh data when there are more than one scheduler process and they 
both make the same placement decision).

The shared-state scheduler has the most amount of update messages. It 
sends an update message to each scheduler in the system every time 
anything at all happens on the compute node, in addition to messages 
involving claims -- sending, confirming and timing them out -- all of 
which affect each scheduler process' state cache.

> 2. Final decision accuracy: I think the accuracy of the final decision
> are high in all three schedulers, because until now the consistent
> resource view and the final resource consumption with claims are all in
> the same place. It’s resource trackers in shared-state scheduler and
> legacy scheduler, and it’s the resource-provider db in resource-provider
> scheduler.

Agreed, I don't believe the final decision accuracy will be affected 
much by the three designs. It's the speed by which the decision can be 
reached and the concurrency at which placement decisions can be made 
that are the differing metrics we are measuring.

> 3. Scheduler decision accuracy: IMO the order of accuracy of a single
> schedule decision is resource-provider > shared-state >> legacy
> scheduler. The resource-provider scheduler can get the accurate resource
> view directly from db. Shared-state scheduler is getting the most
> accurate resource view by constantly collecting updates from resource
> trackers and by tracking the scheduler claims from schedulers to RTs.
> Legacy scheduler’s decision is the worst because it doesn’t track its
> claims and get resource views from compute nodes records which are not
> that accurate.

I don't see how the shared-state scheduler is getting the most accurate 
resource view. It is only in extreme circumstances that the 
resource-provider scheduler's view of the resources in a system (all of 
which is stored without caching in the database) would differ from the 
"actual" inventory on a compute node.

> 4. Design goal difference:
> The fundamental design goal of the two new schedulers is different. Copy
> my views from [2], I think it is the choice between “the loose
> distributed consistency with retries” and “the strict centralized
> consistency with locks”.

There are a couple other things that I believe we should be documenting, 
considering and measuring with regards to scheduler designs:

a) Debuggability

The ability of a system to be debugged and for requests to that system 
to be diagnosed is a critical component to the benefit of a particular 
system design. I'm hoping that by removing a lot of the moving parts 
from the legacy filter scheduler design (removing the caching, removing 
the Python-side filtering and weighing, removing the interval between 
which placement decisions can conflict, removing the cost and frequency 
of retry operations) that the resource-provider scheduler design will 
become simpler for operators to use.

b) Simplicity

Goes to the above point about debuggability, but I've always tried to 
follow the mantra that the best software design is not when you've added 
the last piece to it, but rather when you've removed the last piece from 
it and still have a functioning and performant system. Having a 
scheduler that can tackle the process of tracking resources, deciding on 
placement, and claiming those resources instead of playing an intricate 
dance of keeping state caches valid will, IMHO, lead to a better scheduler.

> As can be seen in the illustrations [1], the main compatibility issue
> between shared-state and resource-provider scheduler is caused by the
> different location of claim/consumption and the assumed consistent
> resource view. IMO unless the claims are allowed to happen in both
> places(resource tracker and resource-provider db), it seems difficult to
> make shared-state and resource-provider scheduler work together.

Yes, I don't see the two approaches being particularly compatible for 
the reason you state above.

That said, what we've discussed is having a totally new scheduler 
RESTful API that would do the claims in the scheduler 
(claim_resources()) and leave the existing select_destinations() call 
as-is to allow some deprecation and fallback if everything goes 
terribly, terribly wrong.


> [1]
> https://docs.google.com/document/d/1iNUkmnfEH3bJHDSenmwE4A1Sgm3vnqa6oZJWtTumAkw/edit?usp=sharing
> [2] https://review.openstack.org/#/c/271823/
> Regards,
> -Yingxin
> *From:*Sylvain Bauza [mailto:sbauza at redhat.com]
> *Sent:* Monday, February 15, 2016 9:48 PM
> *To:* OpenStack Development Mailing List (not for usage questions)
> <openstack-dev at lists.openstack.org>
> *Subject:* Re: [openstack-dev] [nova] A prototype implementation towards
> the "shared state scheduler"
> Le 15/02/2016 10:48, Cheng, Yingxin a écrit :
>     Thanks Sylvain,
>     1. The below ideas will be extended to a spec ASAP.
> Nice, looking forward to it then :-)
>     2. Thanks for providing concerns I’ve not thought it yet, they will
>     be in the spec soon.
>     3. Let me copy my thoughts from another thread about the integration
>     with resource-provider:
>     The idea is about “Only compute node knows its own final
>     compute-node resource view” or “The accurate resource view only
>     exists at the place where it is actually consumed.” I.e., The
>     incremental updates can only come from the actual “consumption”
>     action, no matter where it is(e.g. compute node, storage service,
>     network service, etc.). Borrow the terms from resource-provider,
>     compute nodes can maintain its accurate version of
>     “compute-node-inventory” cache, and can send incremental updates
>     because it actually consumes compute resources, furthermore, storage
>     service can also maintain an accurate version of “storage-inventory”
>     cache and send incremental updates if it also consumes storage
>     resources. If there are central services in charge of consuming all
>     the resources, the accurate cache and updates must come from them.
> That is one of the things I'd like to see in your spec, and how you
> could interact with the new model.
> Thanks,
> -Sylvain
>     Regards,
>     -Yingxin
>     *From:*Sylvain Bauza [mailto:sbauza at redhat.com]
>     *Sent:* Monday, February 15, 2016 5:28 PM
>     *To:* OpenStack Development Mailing List (not for usage questions)
>     <openstack-dev at lists.openstack.org>
>     <mailto:openstack-dev at lists.openstack.org>
>     *Subject:* Re: [openstack-dev] [nova] A prototype implementation
>     towards the "shared state scheduler"
>     Le 15/02/2016 06:21, Cheng, Yingxin a écrit :
>         Hi,
>         I’ve uploaded a prototype
>         https://review.openstack.org/#/c/280047/ to testify its design
>         goals in accuracy, performance, reliability and compatibility
>         improvements. It will also be an Austin Summit Session if
>         elected:
>         https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presentation/7316
>         I want to gather opinions about this idea:
>         1. Is this feature possible to be accepted in the Newton release?
>     Such feature requires a spec file to be written
>     http://docs.openstack.org/developer/nova/process.html#how-do-i-get-my-code-merged
>     Ideally, I'd like to see your below ideas written in that spec file
>     so it would be the best way to discuss on the design.
>         2. Suggestions to improve its design and compatibility.
>     I don't want to go into details here (that's rather the goal of the
>     spec for that), but my biggest concerns would be when reviewing the
>     spec :
>       - how this can meet the OpenStack mission statement (ie.
>     ubiquitous solution that would be easy to install and massively
>     scalable)
>       - how this can be integrated with the existing (filters, weighers)
>     to provide a clean and simple path for operators to upgrade
>       - how this can be supporting rolling upgrades (old computes
>     sending updates to new scheduler)
>       - how can we test it
>       - can we have the feature optional for operators
>         3. Possibilities to integrate with resource-provider bp series:
>         I know resource-provider is the major direction of Nova
>         scheduler, and there will be fundamental changes in the future,
>         especially according to the bp
>         https://review.openstack.org/#/c/271823/1/specs/mitaka/approved/resource-providers-scheduler.rst.
>         However, this prototype proposes a much faster and compatible
>         way to make schedule decisions based on scheduler caches. The
>         in-memory decisions are made at the same speed with the caching
>         scheduler, but the caches are kept consistent with compute nodes
>         as quickly as possible without db refreshing.
>     That's the key point, thanks for noticing our priorities. So, you
>     know that our resource modeling is drastically subject to change in
>     Mitaka and Newton. That is the new game, so I'd love to see how you
>     plan to interact with that.
>     Ideally, I'd appreciate if Jay Pipes, Chris Dent and you could share
>     your ideas because all of you are having great ideas to improve a
>     current frustrating solution.
>     -Sylvain
>         Here is the detailed design of the mentioned prototype:
>          >>----------------------------
>         Background:
>         The host state cache maintained by host manager is the scheduler
>         resource view during schedule decision making. It is updated
>         whenever a request is received[1], and all the compute node
>         records are retrieved from db every time. There are several
>         problems in this update model, proven in experiments[3]:
>         1. Performance: The scheduler performance is largely affected by
>         db access in retrieving compute node records. The db block time
>         of a single request is 355ms in average in the deployment of 3
>         compute nodes, compared with only 3ms in in-memory
>         decision-making. Imagine there could be at most 1k nodes, even
>         10k nodes in the future.
>         2. Race conditions: This is not only a parallel-scheduler
>         problem, but also a problem using only one scheduler. The
>         detailed analysis of one-scheduler-problem is located in bug
>         analysis[2]. In short, there is a gap between the scheduler
>         makes a decision in host state cache and the
>         compute node updates its in-db resource record according to that
>         decision in resource tracker. A recent scheduler resource
>         consumption in cache can be lost and overwritten by compute node
>         data because of it, result in cache inconsistency and unexpected
>         retries. In a one-scheduler experiment using 3-node deployment,
>         there are 7 retries out of 31 concurrent schedule requests
>         recorded, results in 22.6% extra performance overhead.
>         3. Parallel scheduler support: The design of filter scheduler
>         leads to an "even worse" performance result using parallel
>         schedulers. In the same experiment with 4 schedulers on separate
>         machines, the average db block time is increased to 697ms per
>         request and there are 16 retries out of 31 schedule requests,
>         namely 51.6% extra overhead.
>         Improvements:
>         This prototype solved the mentioned issues above by implementing
>         a new update model to scheduler host state cache. Instead of
>         refreshing caches from db, every compute node maintains its
>         accurate version of host state cache updated by the resource
>         tracker, and sends incremental updates directly to schedulers.
>         So the scheduler cache are synchronized to the correct state as
>         soon as possible with the lowest overhead. Also, scheduler will
>         send resource claim with its decision to the target compute
>         node. The compute node can decide whether the resource claim is
>         successful immediately by its local host state cache and send
>         responds back ASAP. With all the claims are tracked from
>         schedulers to compute nodes, no false overwrites will happen,
>         and thus the gaps between scheduler cache and real compute node
>         states are minimized. The benefits are obvious with recorded
>         experiments[3] compared with caching scheduler and filter scheduler:
>         1. There is no db block time during scheduler decision making,
>         the average decision time per request is about 3ms in both
>         single and multiple scheduler scenarios, which is equal to the
>         in-memory decision time of filter scheduler and caching scheduler.
>         2. Since the scheduler claims are tracked and the "false
>         overwrite" is eliminated, there should be 0 retries in
>         one-scheduler deployment, as proven in the experiment. Thanks to
>         the quick claim responding implementation, there are only 2
>         retries out of 31 requests in the 4-scheduler experiment.
>         3. All the filtering and weighing algorithms are compatible
>         because the data structure of HostState is unchanged. In fact,
>         this prototype even supports filter scheduler running at the
>         same time(already tested). Like other operations with resource
>         changes such as migration, resizing or shelving, they make
>         claims in the resource tracker directly and update the compute
>         node host state immediately without major changes.
>         Extra features:
>         More efforts are made to better adjust the implementation to
>         real-world scenarios, such as network issues, service
>         unexpectedly down and overwhelming messages etc:
>         1. The communication between schedulers and compute nodes are
>         only casts, there are no RPC calls thus no blocks during scheduling.
>         2. All updates from nodes to schedulers are labelled with an
>         incremental seed, so any message reordering, lost or duplication
>         due to network issues can be detected by MessageWindow
>         immediately. The inconsistent cache can be detected and
>         refreshed correctly.
>         3. The overwhelming messages are compressed by MessagePipe in
>         its async mode. There is no need to send all the messages one by
>         one in the MQ, they can be merged before sent to schedulers.
>         4. When a new service is up or recovered, it sends notifications
>         to all known remotes for quick cache synchronization, even
>         before the service record is available in db. And if a remote
>         service is unexpectedly down according to service group records,
>         no more messages will send to it. The ComputeFilter is also
>         removed because of this feature, the scheduler can detect remote
>         compute nodes by itself.
>         5. In fact the claim tracking is not only from schedulers to
>         compute nodes, but also from compute-node host state to the
>         resource tracker. One reason is that there is still a gap
>         between a claim is acknowledged by compute-node host state and
>         the claim is successful in resource tracker. It is necessary to
>         track those unhandled claims to keep host state accurate. The
>         second reason is to separate schedulers from compute node and
>         resource trackers. Scheduler only export limited interfaces
>         `update_from_compute` and `handle_rt_claim_failure` to compute
>         service and the RT, so the testing and reusing are easier with
>         clear boundaries.
>         TODOs:
>         There are still many features to be implemented, the most
>         important are unit tests and incremental updates to PCI and NUMA
>         resources, all of them are marked out inline.
>         References:
>         [1]
>         https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L104
>         [2] https://bugs.launchpad.net/nova/+bug/1341420/comments/24
>         [3] http://paste.openstack.org/show/486929/
>         ----------------------------<<
>         The original commit history of this prototype is located in
>         https://github.com/cyx1231st/nova/commits/shared-scheduler
>         For instructions to install and test this prototype, please
>         refer to the commit message of
>         https://review.openstack.org/#/c/280047/
>         Regards,
>         -Yingxin
>         __________________________________________________________________________
>         OpenStack Development Mailing List (not for usage questions)
>         Unsubscribe:OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>         <mailto:OpenStack-dev-request at lists.openstack.org?subject:unsubscribe>
>         http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>     __________________________________________________________________________
>     OpenStack Development Mailing List (not for usage questions)
>     Unsubscribe:OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>     <mailto:OpenStack-dev-request at lists.openstack.org?subject:unsubscribe>
>     http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

More information about the OpenStack-dev mailing list