[openstack-dev] [nova][scheduler] More test results for the "eventually consistent host state" prototype
yingxin.cheng at intel.com
Mon Jun 27 07:45:31 UTC 2016
According to the feedback  from Austin design summit, I prepared my environment with pre-loaded computes and finished a new round of performance profiling using the tool . I also updated the prototype  to simplify the implementation in compute-node side, which makes the implementation closer to the design described in the spec .
This set of results are more comprehensive, it includes the analysis of “eventually consistent host states” prototype , default filter scheduler, and the caching scheduler. They are tested with various scenarios in 1000-compute-node environment, with real controller services, real rabbit-MQ and real MySQL database. The new set of experiments contains 55 repeatable results . Don’t be afraid about the verbose data, I’ve dug out the conclusions.
To better understand what’s happening during scheduling in different scenarios, all of them are visualized in the doc . They are complementary to what I had presented in the Austin design summit, the 7th page of the ppt .
Note that the “pre-load scenario” allows only 49 new instances to be launched in the 1000-node environment. It means when 50 requests are sent, there should be 1 and only 1 failed request if the scheduler decision is accurate.
Detailed analysis with illustration : https://docs.google.com/document/d/1qFNROdJxj4m1lXe250DW3XAAY02QHmlTm1N2nEHVVPg/edit?usp=sharing
In all test cases, nova is dispatching 50 instant requests to 1000 compute nodes. The aiming is to compare the behavior of 3 types of schedulers, with preloaded or empty-loaded scenarios, and with 1 or 2 scheduler services. So that’s 3*2*2=12 set of experiments, each set is tested multiple times.
In scenario S1(i.e. 1 scheduler with empty loaded compute nodes), we can see from A2 very clearly that the entire boot process of filter scheduler is suffering from nova-scheduler service. Filter scheduler has a very slow speed to consume those 50 requests, causing all the requests being blocked before scheduler service in the yellow area. The ROOT CAUSE of it is the “cache-refresh” step before filtering (i.e. `nova.scheduler.filter_scheduler.FilterScheduler._get_all_host_states`). I’ve discussed this bottleneck in details in the Austin summit session “Dive into nova scheduler performance: where is the bottleneck” . This is also proved by caching scheduler because it excludes the “cache-refresh” bottleneck and only uses in-memory filtering. By simply excluding “cache-refresh”, the performance benefits are huge: the query time is reduced by 87%, and the overall throughput (i.e. the delivered requests per second in this cloud) is multiplied by 8.24, see A3 for illustration. The “eventually consistent host states” prototype also excludes this bottleneck and takes a more meticulous way to synchronize scheduler caches. It is slightly slower than caching scheduler, because there is an overhead to apply incremental updates from compute nodes. The query time is reduced by 79% and the overall throughput is multiplied by 5.63 in average in S1.
In preload scenario S2, we can see all 3 types of scheduler are faster than their empty loaded scenario. That’s because the filters can now prune the hosts from 1000 to only 49, so the last few filters don’t need to process 1000 host states, they can be much faster. But filter scheduler (B2) cannot be benefit much from faster filtering, because its bottleneck is still in “cache refresh”. However, it means different for caching scheduler and the prototype, because their performance heavily depend on in-memory filtering. For caching scheduler (B3), the query time is reduced by 81% and the overall throughput is multiplied by 7.52 compared with filter scheduler. And for the prototype (B1), the query time is reduced by 83% and the throughput is multiplied by 7.92 in average. Also, all those scheduler decisions are accurate, because their first decisions are all correct without any retries in preload scenario, and only 1 of 50 requests is failed due to “no valid host” error.
In scenario S3 with 2 scheduler services and empty-loaded compute nodes, the overall schedule bandwidths are all multiplied by 2 internally. Filter scheduler (C2) has a major improvement, because its scheduler bandwidth is multiplied. But the other two types don’t have similar improvement, because their bottleneck is now in nova-api service instead. It is a wrong decision to add more schedulers when the actual bottleneck is happening elsewhere. And worse, multiple schedulers will introduce more race conditions as well as other overhead. However, the performance of caching scheduler (C3) and the prototype (C1) are still much better, the query time is reduced by 65% and the overall through is multiplied by 3.67 in average.
In preload scenario S4 with 2 schedulers, the race condition is surfaced because there’re only 49 slots in 1000 hosts in the cloud, and they will all result in retries. The result (D1, D2, D3) shows the retry rates are similar under instant 50 requests, but there is already a further thought to improve the prototype. And the result should be different under continuous boot and delete requests. Some tests also show that the caches in caching scheduler are still outdated even after 1 minute. For example, in the test “results-1s-1000n-50r-0p222-preload-caching3”, 19 requests are failed because of outdated caches. And in the test “results-2s-1000n-50r-0p222-preload-caching4”, 21 requests are failed also for the same reason.
Quick conclusion here
In short, this prototype  has following improvements or guarantees:
1. When empty loaded, its performance is much better than filter scheduler (5.63x better in 1000 nodes), and close to caching scheduler.
2. With pre-load, its performance is even better than filter scheduler (7.92x) , and closer to caching scheduler.
3. Its placement accuracy is 100% in 1 scheduler scenario.
4. There’s no major change to the schedule process, it is highly compatible to the existing scheduler architecture (before resource provider).
5. The biggest bottleneck — “cache-refresh” is resolved by this prototype, and instead, nova-api becomes the new bottleneck that is limiting the throughput.
6. Racing is allowed among schedulers because of its lock-free design, and the racing rate in 2-scheduler scenario is acceptable.
About the profiling tool 
It is worth noting that this tool is much more precise and verbose than Rally. The whole analysis is offline, compared with Rally which is adding extra pressure to API to ask for status. And the analysis is fine-grained based on injected logs, compared with Rally only based on API level responses.
This tool can simulate compute nodes by attaching fake virt driver and launch the nodes in processes. It can also be deployed in the real OpenStack cluster by monkey-patching related nova services. It has already successfully deployed with China Mobile 1000-node environment, which uses multiple controllers. This tool can profile all the existing schedulers from Kilo to Mitaka release, under various configurations, and using the same analysis framework. This also means this prototype doesn’t introduce major changes to the existing nova architecture.
This prototype isn’t a design of “shared-state scheduler”, it is an improvement to the existing host manager. But it is a very important step towards the “shared-state scheduler” described in the 11th page of ppt . Basically it leverages inter-process commutation with workers to improve scheduler throughput. Inter-process communication is much faster than network, that’s why my previous analysis  based on placement bench  has extreme low race conditions (only 3% of 12000 requests using 8 workers) and much better performance (38 times better according to the data ). The end goal is to use this model inside scheduler service.
Current plan is still relying on the progress of resource provider series. Substantial design changes might be needed to addressed for the new architecture. But it doesn’t matter from my side, it is more important and beneficial to implement a generic scheduler service as early as possible.
More information about the OpenStack-dev