[openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"
Sylvain Bauza
sbauza at redhat.com
Mon Feb 15 13:47:41 UTC 2016
Le 15/02/2016 10:48, Cheng, Yingxin a écrit :
>
> Thanks Sylvain,
>
> 1. The below ideas will be extended to a spec ASAP.
>
Nice, looking forward to it then :-)
>
> 2. Thanks for providing concerns I’ve not thought it yet, they will be
> in the spec soon.
>
> 3. Let me copy my thoughts from another thread about the integration
> with resource-provider:
>
> The idea is about “Only compute node knows its own final compute-node
> resource view” or “The accurate resource view only exists at the place
> where it is actually consumed.” I.e., The incremental updates can only
> come from the actual “consumption” action, no matter where it is(e.g.
> compute node, storage service, network service, etc.). Borrow the
> terms from resource-provider, compute nodes can maintain its accurate
> version of “compute-node-inventory” cache, and can send incremental
> updates because it actually consumes compute resources, furthermore,
> storage service can also maintain an accurate version of
> “storage-inventory” cache and send incremental updates if it also
> consumes storage resources. If there are central services in charge of
> consuming all the resources, the accurate cache and updates must come
> from them.
>
That is one of the things I'd like to see in your spec, and how you
could interact with the new model.
Thanks,
-Sylvain
> Regards,
>
> -Yingxin
>
> *From:*Sylvain Bauza [mailto:sbauza at redhat.com]
> *Sent:* Monday, February 15, 2016 5:28 PM
> *To:* OpenStack Development Mailing List (not for usage questions)
> <openstack-dev at lists.openstack.org>
> *Subject:* Re: [openstack-dev] [nova] A prototype implementation
> towards the "shared state scheduler"
>
> Le 15/02/2016 06:21, Cheng, Yingxin a écrit :
>
> Hi,
>
> I’ve uploaded a prototype https://review.openstack.org/#/c/280047/
> <https://review.openstack.org/#/c/280047/> to testify its design
> goals in accuracy, performance, reliability and compatibility
> improvements. It will also be an Austin Summit Session if elected:
> https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presentation/7316
>
>
> I want to gather opinions about this idea:
>
> 1. Is this feature possible to be accepted in the Newton release?
>
>
> Such feature requires a spec file to be written
> http://docs.openstack.org/developer/nova/process.html#how-do-i-get-my-code-merged
>
> Ideally, I'd like to see your below ideas written in that spec file so
> it would be the best way to discuss on the design.
>
>
>
> 2. Suggestions to improve its design and compatibility.
>
>
> I don't want to go into details here (that's rather the goal of the
> spec for that), but my biggest concerns would be when reviewing the spec :
> - how this can meet the OpenStack mission statement (ie. ubiquitous
> solution that would be easy to install and massively scalable)
> - how this can be integrated with the existing (filters, weighers) to
> provide a clean and simple path for operators to upgrade
> - how this can be supporting rolling upgrades (old computes sending
> updates to new scheduler)
> - how can we test it
> - can we have the feature optional for operators
>
>
>
> 3. Possibilities to integrate with resource-provider bp series: I
> know resource-provider is the major direction of Nova scheduler,
> and there will be fundamental changes in the future, especially
> according to the bp
> https://review.openstack.org/#/c/271823/1/specs/mitaka/approved/resource-providers-scheduler.rst.
> However, this prototype proposes a much faster and compatible way
> to make schedule decisions based on scheduler caches. The
> in-memory decisions are made at the same speed with the caching
> scheduler, but the caches are kept consistent with compute nodes
> as quickly as possible without db refreshing.
>
>
> That's the key point, thanks for noticing our priorities. So, you know
> that our resource modeling is drastically subject to change in Mitaka
> and Newton. That is the new game, so I'd love to see how you plan to
> interact with that.
> Ideally, I'd appreciate if Jay Pipes, Chris Dent and you could share
> your ideas because all of you are having great ideas to improve a
> current frustrating solution.
>
> -Sylvain
>
>
>
> Here is the detailed design of the mentioned prototype:
>
> >>----------------------------
>
> Background:
>
> The host state cache maintained by host manager is the scheduler
> resource view during schedule decision making. It is updated
> whenever a request is received[1], and all the compute node
> records are retrieved from db every time. There are several
> problems in this update model, proven in experiments[3]:
>
> 1. Performance: The scheduler performance is largely affected by
> db access in retrieving compute node records. The db block time of
> a single request is 355ms in average in the deployment of 3
> compute nodes, compared with only 3ms in in-memory
> decision-making. Imagine there could be at most 1k nodes, even 10k
> nodes in the future.
>
> 2. Race conditions: This is not only a parallel-scheduler problem,
> but also a problem using only one scheduler. The detailed analysis
> of one-scheduler-problem is located in bug analysis[2]. In short,
> there is a gap between the scheduler makes a decision in host
> state cache and the
>
> compute node updates its in-db resource record according to that
> decision in resource tracker. A recent scheduler resource
> consumption in cache can be lost and overwritten by compute node
> data because of it, result in cache inconsistency and unexpected
> retries. In a one-scheduler experiment using 3-node deployment,
> there are 7 retries out of 31 concurrent schedule requests
> recorded, results in 22.6% extra performance overhead.
>
> 3. Parallel scheduler support: The design of filter scheduler
> leads to an "even worse" performance result using parallel
> schedulers. In the same experiment with 4 schedulers on separate
> machines, the average db block time is increased to 697ms per
> request and there are 16 retries out of 31 schedule requests,
> namely 51.6% extra overhead.
>
> Improvements:
>
> This prototype solved the mentioned issues above by implementing a
> new update model to scheduler host state cache. Instead of
> refreshing caches from db, every compute node maintains its
> accurate version of host state cache updated by the resource
> tracker, and sends incremental updates directly to schedulers. So
> the scheduler cache are synchronized to the correct state as soon
> as possible with the lowest overhead. Also, scheduler will send
> resource claim with its decision to the target compute node. The
> compute node can decide whether the resource claim is successful
> immediately by its local host state cache and send responds back
> ASAP. With all the claims are tracked from schedulers to compute
> nodes, no false overwrites will happen, and thus the gaps between
> scheduler cache and real compute node states are minimized. The
> benefits are obvious with recorded experiments[3] compared with
> caching scheduler and filter scheduler:
>
> 1. There is no db block time during scheduler decision making, the
> average decision time per request is about 3ms in both single and
> multiple scheduler scenarios, which is equal to the in-memory
> decision time of filter scheduler and caching scheduler.
>
> 2. Since the scheduler claims are tracked and the "false
> overwrite" is eliminated, there should be 0 retries in
> one-scheduler deployment, as proven in the experiment. Thanks to
> the quick claim responding implementation, there are only 2
> retries out of 31 requests in the 4-scheduler experiment.
>
> 3. All the filtering and weighing algorithms are compatible
> because the data structure of HostState is unchanged. In fact,
> this prototype even supports filter scheduler running at the same
> time(already tested). Like other operations with resource changes
> such as migration, resizing or shelving, they make claims in the
> resource tracker directly and update the compute node host state
> immediately without major changes.
>
> Extra features:
>
> More efforts are made to better adjust the implementation to
> real-world scenarios, such as network issues, service unexpectedly
> down and overwhelming messages etc:
>
> 1. The communication between schedulers and compute nodes are only
> casts, there are no RPC calls thus no blocks during scheduling.
>
> 2. All updates from nodes to schedulers are labelled with an
> incremental seed, so any message reordering, lost or duplication
> due to network issues can be detected by MessageWindow
> immediately. The inconsistent cache can be detected and refreshed
> correctly.
>
> 3. The overwhelming messages are compressed by MessagePipe in its
> async mode. There is no need to send all the messages one by one
> in the MQ, they can be merged before sent to schedulers.
>
> 4. When a new service is up or recovered, it sends notifications
> to all known remotes for quick cache synchronization, even before
> the service record is available in db. And if a remote service is
> unexpectedly down according to service group records, no more
> messages will send to it. The ComputeFilter is also removed
> because of this feature, the scheduler can detect remote compute
> nodes by itself.
>
> 5. In fact the claim tracking is not only from schedulers to
> compute nodes, but also from compute-node host state to the
> resource tracker. One reason is that there is still a gap between
> a claim is acknowledged by compute-node host state and the claim
> is successful in resource tracker. It is necessary to track those
> unhandled claims to keep host state accurate. The second reason is
> to separate schedulers from compute node and resource trackers.
> Scheduler only export limited interfaces `update_from_compute` and
> `handle_rt_claim_failure` to compute service and the RT, so the
> testing and reusing are easier with clear boundaries.
>
> TODOs:
>
> There are still many features to be implemented, the most
> important are unit tests and incremental updates to PCI and NUMA
> resources, all of them are marked out inline.
>
> References:
>
> [1]
> https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L104
>
>
> [2] https://bugs.launchpad.net/nova/+bug/1341420/comments/24
> <https://bugs.launchpad.net/nova/+bug/1341420/comments/24>
>
> [3] http://paste.openstack.org/show/486929/
>
> ----------------------------<<
>
> The original commit history of this prototype is located in
> https://github.com/cyx1231st/nova/commits/shared-scheduler
>
> For instructions to install and test this prototype, please refer
> to the commit message of https://review.openstack.org/#/c/280047/
>
> Regards,
>
> -Yingxin
>
>
>
>
> __________________________________________________________________________
>
> OpenStack Development Mailing List (not for usage questions)
>
> Unsubscribe:OpenStack-dev-request at lists.openstack.org?subject:unsubscribe <mailto:OpenStack-dev-request at lists.openstack.org?subject:unsubscribe>
>
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20160215/1bb9775a/attachment.html>
More information about the OpenStack-dev
mailing list