[openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

Cheng, Yingxin yingxin.cheng at intel.com
Tue Mar 1 08:34:50 UTC 2016


I have simulated the distributed resource management with the incremental update model based on Jay's benchmarking framework: https://github.com/cyx1231st/placement-bench/tree/shared-state-demonstration. The complete result lies at http://paste.openstack.org/show/488677/. It's ran by a VM with 4 cores and 4GB RAM, and the mysql service is using the default settings with the "innodb_buffer_pool_size" setting to "2G". The number of simulated compute nodes are set to "300".

First, the conclusions from the result of the eventually consistent scheduler state simulation(i.e. rows that "do claim in compute?" = Yes):
1. The final decision accuracy is 100%: No resource usage will exceed the real capacity by examining the rationality of db records at the end of each run.
2. The schedule decision accuracy is 100% if there's only one scheduler: The successful scheduler decisions are all succeeded in compute nodes, thus no retries recorded, i.e. "Count of requests processed" = "Placement query count". See http://paste.openstack.org/show/488696/
3. The schedule decision accuracy is 100% if "Partition strategy" is set to "modulo", no matter how many scheduler processes. See http://paste.openstack.org/show/488697/
4. No racing is happened if there's only one scheduler process or the "Partition strategy" is set to "modulo", explained by 2. 3.
5. The multiple-schedulers racing rate is extremely low using the "spread" or "random" placement strategy used by legacy filter scheduler: This rate is 3.0% using "spread" strategy and 0.15% using "random" strategy, note that there are 8 workers in processing about 12000 requests within 20 seconds. The result is even better than resource-provider scheduler(rows that "do claim in compute?" = No), that's 82.9% using "spread" strategy and 2.52% using "random" strategy of 12000 requests within 70-190 seconds. See http://paste.openstack.org/show/488699/. Note, retry rate is calculated by (Placement query count - Count of requests processed) / Count of requests processed * 100%
#overwhelming messages
6. The total count of messages are only affected by the number of schedulers and the number of schedule queries, NOT by the number of compute nodes. See http://paste.openstack.org/show/488701/
7. The messages per successful query is (number_of_schedulers + 2), its growth pattern is lineral and only affected by scheduler processes. And there is only 1 message if the query failed. It is not a huge number plus there are no additional messages in order to access db during scheduling.

Second, here's what I've found in the centralized db claim design(i.e. rows that "do claim in compute?" = No):
1. The speed of legacy python filtering is not slow(see rows that "Filter strategy" = python): "Placement total query time" records the cost of all query time including fetching states from db and filtering using python. The actual cost of python filtering is (Placement_total_query_time - Placement_total_db_query_time), and that's only about 1/4 of total cost or even less. It also means python in-memory filtering is much faster than db filtering in this experiment. See http://paste.openstack.org/show/488710/ 
2. The speed of `db filter strategy` and the legacy `python filter strategy` are in the same order of magnitude, not a very huge improvement. See the comparison of column "Placement total query time". Note that the extra cost of `python filter strategy` mainly comes from "Placement total db query time"(i.e. fetching states from db). See http://paste.openstack.org/show/488709/

Third, my major concern of "centralized db claim" design is: Putting too much scheduling works into the centralized db, and it is not scalable by simply adding conductors and schedulers.
1. The filtering works are majorly done inside db by executing complex sqls. If the filtering logic is much more complex(currently only CPU and RAM are accounted in the experiment), the db overhead will be considerable.
2. The racing of centralized claims are resolved by rolling back transactions and by checking the generations(see the design of "compare and update" strategy in https://review.openstack.org/#/c/283253/), it also causes additional overhead to db.
3. The db overhead of filtering operation can be relaxed by moving them to schedulers, that will be 38 times faster and can be executed in parallel by schedulers according to the column "Placement avg query time". See http://paste.openstack.org/show/488715/
4. The "compare and update" overhead can be partially relaxed by using distributed resource claims in resource trackers. There is no need to roll back transactions in updating inventories of compute local resources in order to be accurate. It is confirmed by checking the db records at the end of each run of eventually consistent scheduler state design.
5. If a) all the filtering operations are done inside schedulers,
        b) schedulers do not need to refresh caches from db because of incremental updates,
        c) it is no need to do "compare and update" to compute-local resources(i.e. none-shared resources),
     then here is the performance comparison using 1 scheduler instances: http://paste.openstack.org/show/488717/

Finally, it is not fair to directly compare the actual ability of resource-provider scheduler and shared-state scheduler using this benchmarking tool, because there are 300 more processes needed to be created in order to simulate the distributed resource management of 300 compute nodes, and there are no conductors and MQ in the simulation. But I think it is still useful to provide at least some statistics.


On Wednesday, February 24, 2016 5:04 PM, Cheng, Yingxin wrote:
> Very sorry for the delay, it feels hard for me to reply to all the concerns raised,
> most of you have years more experiences. I've tried hard to present that there is
> a direction to solve the issues of existing filter scheduler in multiple areas
> including performance, decision accuracy, multiple scheduler support and race
> condition. I'll also support any other solutions if they can solve the same issue
> elegantly.
> @Jay Pipes:
> I feel that scheduler team will agree with the design that can fulfill thousands of
> placement queries with thousands of nodes. But as a scheduler that will be
> splitted out to be used in wider areas, it's not simple to predict the that
> requirement, so I'm not agree with the statement "It doesn't need to be high-
> performance at all". There is no system that can be existed without performance
> bottleneck, including resource-provider scheduler and shared-state scheduler. I
> was trying to point out where is the potential bottleneck in each design and how
> to improve if the worst thing happens, quote:
> "The performance bottleneck of resource-provider and legacy scheduler is from
> the centralized db (REMOVED: and scheduler cache refreshing). It can be
> alleviated by changing to a stand-alone high performance database. And the
> cache refreshing is designed to be replaced by to direct SQL queries according to
> resource-provider scheduler spec. The performance bottleneck of shared-state
> scheduler may come from the overwhelming update messages, it can also be
> alleviated by changing to stand-alone distributed message queue and by using
> the "MessagePipe" to merge messages."
> I'm not saying that there is a bottleneck of resource-provider scheduler in
> fulfilling current design goal. The ability of resource-provider scheduler is
> already proven by a nice modeling tool implemented by Jay, I trust it. But I care
> more about the actual limit of each design and how easily they can be extended
> to increase that limit. That's why I turned to make efforts in scheduler functional
> test framework(https://review.openstack.org/#/c/281825/). I finally want to
> test scheduler functionality using greenthreads in the gate, and test the
> performance and placement of each design using real processes. And I hope the
> scheduler can open to both centralized and distributed design.
> I've updated my understanding of three designs:
> https://docs.google.com/document/d/1iNUkmnfEH3bJHDSenmwE4A1Sgm3vnq
> a6oZJWtTumAkw/edit?usp=sharing
> The "cache updates" arrows are changed to "resource updates" in resource-
> provider scheduler, because I think resource updates from virt driver are still
> needed to be updated to the central database. Hope this time it's right.
> @Sylvain Bauza:
> As the first step towards shared-state scheduler, the required changes are kept
> at minimum. There are no db modifications needed, so no rolling upgrade issues
> in data migration. The new scheduler can decide not to make decisions to old
> compute nodes, or try to refresh host states from db and use legacy way to
> make decisions until all the compute nodes are upgraded.
> I have to admit that my prototype still lack efforts to deal with overwhelming
> messages. This design works best using the distributed message queue. Also if
> we try to initiate multiple scheduler processes/workers in a single host, there are
> a lot more to be done to reduce update messages between compute nodes and
> scheduler workers. But I see the potential of distributed resource
> management/scheduling and would like to make efforts in this direction.
> If we are agreed that the final decision accuracy are guaranteed in both
> directions, we should care more about the final decision throughput of both
> design. Theoretically it is better because the final consumptions are made
> distributedly, but there exists difficulties in reaching that limit. However, the
> centralized design is easier to approach its theoretical performance because of
> the lightweight implementation inside scheduler and the powerful underlying
> database.
> Regards,
> -Yingxin

More information about the OpenStack-dev mailing list