[openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

Sylvain Bauza sbauza at redhat.com
Mon Feb 15 13:47:41 UTC 2016

Le 15/02/2016 10:48, Cheng, Yingxin a écrit :
> Thanks Sylvain,
> 1. The below ideas will be extended to a spec ASAP.

Nice, looking forward to it then :-)
> 2. Thanks for providing concerns I’ve not thought it yet, they will be 
> in the spec soon.
> 3. Let me copy my thoughts from another thread about the integration 
> with resource-provider:
> The idea is about “Only compute node knows its own final compute-node 
> resource view” or “The accurate resource view only exists at the place 
> where it is actually consumed.” I.e., The incremental updates can only 
> come from the actual “consumption” action, no matter where it is(e.g. 
> compute node, storage service, network service, etc.). Borrow the 
> terms from resource-provider, compute nodes can maintain its accurate 
> version of “compute-node-inventory” cache, and can send incremental 
> updates because it actually consumes compute resources, furthermore, 
> storage service can also maintain an accurate version of 
> “storage-inventory” cache and send incremental updates if it also 
> consumes storage resources. If there are central services in charge of 
> consuming all the resources, the accurate cache and updates must come 
> from them.

That is one of the things I'd like to see in your spec, and how you 
could interact with the new model.

> Regards,
> -Yingxin
> *From:*Sylvain Bauza [mailto:sbauza at redhat.com]
> *Sent:* Monday, February 15, 2016 5:28 PM
> *To:* OpenStack Development Mailing List (not for usage questions) 
> <openstack-dev at lists.openstack.org>
> *Subject:* Re: [openstack-dev] [nova] A prototype implementation 
> towards the "shared state scheduler"
> Le 15/02/2016 06:21, Cheng, Yingxin a écrit :
>     Hi,
>     I’ve uploaded a prototype https://review.openstack.org/#/c/280047/
>     <https://review.openstack.org/#/c/280047/> to testify its design
>     goals in accuracy, performance, reliability and compatibility
>     improvements. It will also be an Austin Summit Session if elected:
>     https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presentation/7316
>     I want to gather opinions about this idea:
>     1. Is this feature possible to be accepted in the Newton release?
> Such feature requires a spec file to be written 
> http://docs.openstack.org/developer/nova/process.html#how-do-i-get-my-code-merged
> Ideally, I'd like to see your below ideas written in that spec file so 
> it would be the best way to discuss on the design.
>     2. Suggestions to improve its design and compatibility.
> I don't want to go into details here (that's rather the goal of the 
> spec for that), but my biggest concerns would be when reviewing the spec :
>  - how this can meet the OpenStack mission statement (ie. ubiquitous 
> solution that would be easy to install and massively scalable)
>  - how this can be integrated with the existing (filters, weighers) to 
> provide a clean and simple path for operators to upgrade
>  - how this can be supporting rolling upgrades (old computes sending 
> updates to new scheduler)
>  - how can we test it
>  - can we have the feature optional for operators
>     3. Possibilities to integrate with resource-provider bp series: I
>     know resource-provider is the major direction of Nova scheduler,
>     and there will be fundamental changes in the future, especially
>     according to the bp
>     https://review.openstack.org/#/c/271823/1/specs/mitaka/approved/resource-providers-scheduler.rst.
>     However, this prototype proposes a much faster and compatible way
>     to make schedule decisions based on scheduler caches. The
>     in-memory decisions are made at the same speed with the caching
>     scheduler, but the caches are kept consistent with compute nodes
>     as quickly as possible without db refreshing.
> That's the key point, thanks for noticing our priorities. So, you know 
> that our resource modeling is drastically subject to change in Mitaka 
> and Newton. That is the new game, so I'd love to see how you plan to 
> interact with that.
> Ideally, I'd appreciate if Jay Pipes, Chris Dent and you could share 
> your ideas because all of you are having great ideas to improve a 
> current frustrating solution.
> -Sylvain
>     Here is the detailed design of the mentioned prototype:
>     >>----------------------------
>     Background:
>     The host state cache maintained by host manager is the scheduler
>     resource view during schedule decision making. It is updated
>     whenever a request is received[1], and all the compute node
>     records are retrieved from db every time. There are several
>     problems in this update model, proven in experiments[3]:
>     1. Performance: The scheduler performance is largely affected by
>     db access in retrieving compute node records. The db block time of
>     a single request is 355ms in average in the deployment of 3
>     compute nodes, compared with only 3ms in in-memory
>     decision-making. Imagine there could be at most 1k nodes, even 10k
>     nodes in the future.
>     2. Race conditions: This is not only a parallel-scheduler problem,
>     but also a problem using only one scheduler. The detailed analysis
>     of one-scheduler-problem is located in bug analysis[2]. In short,
>     there is a gap between the scheduler makes a decision in host
>     state cache and the
>     compute node updates its in-db resource record according to that
>     decision in resource tracker. A recent scheduler resource
>     consumption in cache can be lost and overwritten by compute node
>     data because of it, result in cache inconsistency and unexpected
>     retries. In a one-scheduler experiment using 3-node deployment,
>     there are 7 retries out of 31 concurrent schedule requests
>     recorded, results in 22.6% extra performance overhead.
>     3. Parallel scheduler support: The design of filter scheduler
>     leads to an "even worse" performance result using parallel
>     schedulers. In the same experiment with 4 schedulers on separate
>     machines, the average db block time is increased to 697ms per
>     request and there are 16 retries out of 31 schedule requests,
>     namely 51.6% extra overhead.
>     Improvements:
>     This prototype solved the mentioned issues above by implementing a
>     new update model to scheduler host state cache. Instead of
>     refreshing caches from db, every compute node maintains its
>     accurate version of host state cache updated by the resource
>     tracker, and sends incremental updates directly to schedulers. So
>     the scheduler cache are synchronized to the correct state as soon
>     as possible with the lowest overhead. Also, scheduler will send
>     resource claim with its decision to the target compute node. The
>     compute node can decide whether the resource claim is successful
>     immediately by its local host state cache and send responds back
>     ASAP. With all the claims are tracked from schedulers to compute
>     nodes, no false overwrites will happen, and thus the gaps between
>     scheduler cache and real compute node states are minimized. The
>     benefits are obvious with recorded experiments[3] compared with
>     caching scheduler and filter scheduler:
>     1. There is no db block time during scheduler decision making, the
>     average decision time per request is about 3ms in both single and
>     multiple scheduler scenarios, which is equal to the in-memory
>     decision time of filter scheduler and caching scheduler.
>     2. Since the scheduler claims are tracked and the "false
>     overwrite" is eliminated, there should be 0 retries in
>     one-scheduler deployment, as proven in the experiment. Thanks to
>     the quick claim responding implementation, there are only 2
>     retries out of 31 requests in the 4-scheduler experiment.
>     3. All the filtering and weighing algorithms are compatible
>     because the data structure of HostState is unchanged. In fact,
>     this prototype even supports filter scheduler running at the same
>     time(already tested). Like other operations with resource changes
>     such as migration, resizing or shelving, they make claims in the
>     resource tracker directly and update the compute node host state
>     immediately without major changes.
>     Extra features:
>     More efforts are made to better adjust the implementation to
>     real-world scenarios, such as network issues, service unexpectedly
>     down and overwhelming messages etc:
>     1. The communication between schedulers and compute nodes are only
>     casts, there are no RPC calls thus no blocks during scheduling.
>     2. All updates from nodes to schedulers are labelled with an
>     incremental seed, so any message reordering, lost or duplication
>     due to network issues can be detected by MessageWindow
>     immediately. The inconsistent cache can be detected and refreshed
>     correctly.
>     3. The overwhelming messages are compressed by MessagePipe in its
>     async mode. There is no need to send all the messages one by one
>     in the MQ, they can be merged before sent to schedulers.
>     4. When a new service is up or recovered, it sends notifications
>     to all known remotes for quick cache synchronization, even before
>     the service record is available in db. And if a remote service is
>     unexpectedly down according to service group records, no more
>     messages will send to it. The ComputeFilter is also removed
>     because of this feature, the scheduler can detect remote compute
>     nodes by itself.
>     5. In fact the claim tracking is not only from schedulers to
>     compute nodes, but also from compute-node host state to the
>     resource tracker. One reason is that there is still a gap between
>     a claim is acknowledged by compute-node host state and the claim
>     is successful in resource tracker. It is necessary to track those
>     unhandled claims to keep host state accurate. The second reason is
>     to separate schedulers from compute node and resource trackers.
>     Scheduler only export limited interfaces `update_from_compute` and
>     `handle_rt_claim_failure` to compute service and the RT, so the
>     testing and reusing are easier with clear boundaries.
>     TODOs:
>     There are still many features to be implemented, the most
>     important are unit tests and incremental updates to PCI and NUMA
>     resources, all of them are marked out inline.
>     References:
>     [1]
>     https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L104
>     [2] https://bugs.launchpad.net/nova/+bug/1341420/comments/24
>     <https://bugs.launchpad.net/nova/+bug/1341420/comments/24>
>     [3] http://paste.openstack.org/show/486929/
>     ----------------------------<<
>     The original commit history of this prototype is located in
>     https://github.com/cyx1231st/nova/commits/shared-scheduler
>     For instructions to install and test this prototype, please refer
>     to the commit message of https://review.openstack.org/#/c/280047/
>     Regards,
>     -Yingxin
>     __________________________________________________________________________
>     OpenStack Development Mailing List (not for usage questions)
>     Unsubscribe:OpenStack-dev-request at lists.openstack.org?subject:unsubscribe  <mailto:OpenStack-dev-request at lists.openstack.org?subject:unsubscribe>
>     http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20160215/1bb9775a/attachment.html>

More information about the OpenStack-dev mailing list