[openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

Boris Pavlovic boris at pavlovic.me
Mon Feb 15 07:40:37 UTC 2016


Yingxin,


Basically, what we implemented was next:

- Scheduler consumes RPC updates from Computes
- Scheduler keeps world state in memory (and each message from compute is
treat like a incremental update)
- Incremental update is shared across multiple instances of schedulers
  (so one message from computes is only consumed once)
- Schema less host state (to be able to use single scheduler service for
all resources)

^ All this was done in backward compatible way and it was really easy to
migrate.


If this was accepted, we were planing to work on making scheduler non
depend from Nova (which is actually quite simple task after those change)
 and moving that code outside of Nova.

So solutions are quite similar overall.
I hope you'll get more luck with getting them in upstream.


Best regards,
Boris Pavlovic

On Sun, Feb 14, 2016 at 11:08 PM, Cheng, Yingxin <yingxin.cheng at intel.com>
wrote:

> Thanks Boris, the idea is quite similar in “Do not have db accesses during
> scheduler decision making” because db blocks are introduced at the same
> time, this is very bad for the lock-free design of nova scheduler.
>
>
>
> Another important idea is that “Only compute node knows its own final
> compute-node resource view” or “The accurate resource view only exists at
> the place where it is actually consumed.” I.e., The incremental updates can
> only come from the actual “consumption” action, no matter where it is(e.g.
> compute node, storage service, network service, etc.). Borrow the terms
> from resource-provider, compute nodes can maintain its accurate version of
> “compute-node-inventory” cache, and can send incremental updates because it
> actually consumes compute resources, furthermore, storage service can also
> maintain an accurate version of “storage-inventory” cache and send
> incremental updates if it also consumes storage resources. If there are
> central services in charge of consuming all the resources, the accurate
> cache and updates must come from them.
>
>
>
> The third idea is “compatibility”. This prototype focuses on a very small
> scope by only introducing a new host_manager driver “shared_host_manager”
> with minor other changes. The driver can be changed back to “host_manager”
> very easily. It can also run with filter schedulers and caching schedulers.
> Most importantly, the filtering and weighing algorithms are kept unchanged.
> So more changes can be introduced for the complete version of “shared state
> scheduler” because it is evolving in a gradual way.
>
>
>
>
>
> Regards,
>
> -Yingxin
>
>
>
> *From:* Boris Pavlovic [mailto:boris at pavlovic.me]
> *Sent:* Monday, February 15, 2016 1:59 PM
> *To:* OpenStack Development Mailing List (not for usage questions) <
> openstack-dev at lists.openstack.org>
> *Subject:* Re: [openstack-dev] [nova] A prototype implementation towards
> the "shared state scheduler"
>
>
>
> Yingxin,
>
>
>
> This looks quite similar to the work of this bp:
>
> https://blueprints.launchpad.net/nova/+spec/no-db-scheduler
>
>
>
> It's really nice that somebody is still trying to push scheduler
> refactoring in this way.
>
> Thanks.
>
>
>
> Best regards,
>
> Boris Pavlovic
>
>
>
> On Sun, Feb 14, 2016 at 9:21 PM, Cheng, Yingxin <yingxin.cheng at intel.com>
> wrote:
>
> Hi,
>
>
>
> I’ve uploaded a prototype https://review.openstack.org/#/c/280047/ to
> testify its design goals in accuracy, performance, reliability and
> compatibility improvements. It will also be an Austin Summit Session if
> elected:
> https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presentation/7316
>
>
>
> I want to gather opinions about this idea:
>
> 1. Is this feature possible to be accepted in the Newton release?
>
> 2. Suggestions to improve its design and compatibility.
>
> 3. Possibilities to integrate with resource-provider bp series: I know
> resource-provider is the major direction of Nova scheduler, and there will
> be fundamental changes in the future, especially according to the bp
> https://review.openstack.org/#/c/271823/1/specs/mitaka/approved/resource-providers-scheduler.rst.
> However, this prototype proposes a much faster and compatible way to make
> schedule decisions based on scheduler caches. The in-memory decisions are
> made at the same speed with the caching scheduler, but the caches are kept
> consistent with compute nodes as quickly as possible without db refreshing.
>
>
>
> Here is the detailed design of the mentioned prototype:
>
>
>
> >>----------------------------
>
> Background:
>
> The host state cache maintained by host manager is the scheduler resource
> view during schedule decision making. It is updated whenever a request is
> received[1], and all the compute node records are retrieved from db every
> time. There are several problems in this update model, proven in
> experiments[3]:
>
> 1. Performance: The scheduler performance is largely affected by db access
> in retrieving compute node records. The db block time of a single request
> is 355ms in average in the deployment of 3 compute nodes, compared with
> only 3ms in in-memory decision-making. Imagine there could be at most 1k
> nodes, even 10k nodes in the future.
>
> 2. Race conditions: This is not only a parallel-scheduler problem, but
> also a problem using only one scheduler. The detailed analysis of
> one-scheduler-problem is located in bug analysis[2]. In short, there is a
> gap between the scheduler makes a decision in host state cache and the compute
> node updates its in-db resource record according to that decision in
> resource tracker. A recent scheduler resource consumption in cache can be
> lost and overwritten by compute node data because of it, result in cache
> inconsistency and unexpected retries. In a one-scheduler experiment using
> 3-node deployment, there are 7 retries out of 31 concurrent schedule
> requests recorded, results in 22.6% extra performance overhead.
>
> 3. Parallel scheduler support: The design of filter scheduler leads to an
> "even worse" performance result using parallel schedulers. In the same
> experiment with 4 schedulers on separate machines, the average db block
> time is increased to 697ms per request and there are 16 retries out of 31
> schedule requests, namely 51.6% extra overhead.
>
>
>
> Improvements:
>
> This prototype solved the mentioned issues above by implementing a new
> update model to scheduler host state cache. Instead of refreshing caches
> from db, every compute node maintains its accurate version of host state
> cache updated by the resource tracker, and sends incremental updates
> directly to schedulers. So the scheduler cache are synchronized to the
> correct state as soon as possible with the lowest overhead. Also, scheduler
> will send resource claim with its decision to the target compute node. The
> compute node can decide whether the resource claim is successful
> immediately by its local host state cache and send responds back ASAP. With
> all the claims are tracked from schedulers to compute nodes, no false
> overwrites will happen, and thus the gaps between scheduler cache and real
> compute node states are minimized. The benefits are obvious with recorded
> experiments[3] compared with caching scheduler and filter scheduler:
>
> 1. There is no db block time during scheduler decision making, the average
> decision time per request is about 3ms in both single and multiple
> scheduler scenarios, which is equal to the in-memory decision time of
> filter scheduler and caching scheduler.
>
> 2. Since the scheduler claims are tracked and the "false overwrite" is
> eliminated, there should be 0 retries in one-scheduler deployment, as
> proven in the experiment. Thanks to the quick claim responding
> implementation, there are only 2 retries out of 31 requests in the
> 4-scheduler experiment.
>
> 3. All the filtering and weighing algorithms are compatible because the
> data structure of HostState is unchanged. In fact, this prototype even
> supports filter scheduler running at the same time(already tested). Like
> other operations with resource changes such as migration, resizing or
> shelving, they make claims in the resource tracker directly and update the
> compute node host state immediately without major changes.
>
>
>
> Extra features:
>
> More efforts are made to better adjust the implementation to real-world
> scenarios, such as network issues, service unexpectedly down and
> overwhelming messages etc:
>
> 1. The communication between schedulers and compute nodes are only casts,
> there are no RPC calls thus no blocks during scheduling.
>
> 2. All updates from nodes to schedulers are labelled with an incremental
> seed, so any message reordering, lost or duplication due to network issues
> can be detected by MessageWindow immediately. The inconsistent cache can be
> detected and refreshed correctly.
>
> 3. The overwhelming messages are compressed by MessagePipe in its async
> mode. There is no need to send all the messages one by one in the MQ, they
> can be merged before sent to schedulers.
>
> 4. When a new service is up or recovered, it sends notifications to all
> known remotes for quick cache synchronization, even before the service
> record is available in db. And if a remote service is unexpectedly down
> according to service group records, no more messages will send to it. The
> ComputeFilter is also removed because of this feature, the scheduler can
> detect remote compute nodes by itself.
>
> 5. In fact the claim tracking is not only from schedulers to compute
> nodes, but also from compute-node host state to the resource tracker. One
> reason is that there is still a gap between a claim is acknowledged by
> compute-node host state and the claim is successful in resource tracker. It
> is necessary to track those unhandled claims to keep host state accurate.
> The second reason is to separate schedulers from compute node and resource
> trackers. Scheduler only export limited interfaces `update_from_compute`
> and `handle_rt_claim_failure` to compute service and the RT, so the testing
> and reusing are easier with clear boundaries.
>
>
>
> TODOs:
>
> There are still many features to be implemented, the most important are
> unit tests and incremental updates to PCI and NUMA resources, all of them
> are marked out inline.
>
>
>
> References:
>
> [1]
> https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L104
>
> [2] https://bugs.launchpad.net/nova/+bug/1341420/comments/24
>
> [3] http://paste.openstack.org/show/486929/
>
> ----------------------------<<
>
>
>
> The original commit history of this prototype is located in
> https://github.com/cyx1231st/nova/commits/shared-scheduler
>
> For instructions to install and test this prototype, please refer to the
> commit message of https://review.openstack.org/#/c/280047/
>
>
>
>
>
> Regards,
>
> -Yingxin
>
>
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20160214/260c11b6/attachment.html>


More information about the OpenStack-dev mailing list