[openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

John Garbutt john at johngarbutt.com
Tue Mar 1 11:28:54 UTC 2016

On 1 March 2016 at 08:34, Cheng, Yingxin <yingxin.cheng at intel.com> wrote:
> Hi,
> I have simulated the distributed resource management with the incremental update model based on Jay's benchmarking framework: https://github.com/cyx1231st/placement-bench/tree/shared-state-demonstration. The complete result lies at http://paste.openstack.org/show/488677/. It's ran by a VM with 4 cores and 4GB RAM, and the mysql service is using the default settings with the "innodb_buffer_pool_size" setting to "2G". The number of simulated compute nodes are set to "300".
> [...]
> Second, here's what I've found in the centralized db claim design(i.e. rows that "do claim in compute?" = No):
> 1. The speed of legacy python filtering is not slow(see rows that "Filter strategy" = python): "Placement total query time" records the cost of all query time including fetching states from db and filtering using python. The actual cost of python filtering is (Placement_total_query_time - Placement_total_db_query_time), and that's only about 1/4 of total cost or even less. It also means python in-memory filtering is much faster than db filtering in this experiment. See http://paste.openstack.org/show/488710/
> 2. The speed of `db filter strategy` and the legacy `python filter strategy` are in the same order of magnitude, not a very huge improvement. See the comparison of column "Placement total query time". Note that the extra cost of `python filter strategy` mainly comes from "Placement total db query time"(i.e. fetching states from db). See http://paste.openstack.org/show/488709/

I think it might be time to run this in a nova-scheduler like
environment: eventlet threads responding to rabbit, using pymysql
backend, etc. Note we should get quite a bit of concurrency within a
single nova-scheduler process with the db approach.

I suspect clouds that are largely full of pets, pack/fill first, with
a smaller percentage of cattle on top, will benefit the most, as that
initial DB filter will bring back a small list of hosts.

> Third, my major concern of "centralized db claim" design is: Putting too much scheduling works into the centralized db, and it is not scalable by simply adding conductors and schedulers.
> 1. The filtering works are majorly done inside db by executing complex sqls. If the filtering logic is much more complex(currently only CPU and RAM are accounted in the experiment), the db overhead will be considerable.

So, to clarify, only resources we have claims for in the DB will be
filtered in the DB. All other filters will still occur in python.

The good news, is that if that turns out to be the wrong trade off,
its easy to revert back to doing all the filtering in python, with
zero impact on the DB schema.

> 2. The racing of centralized claims are resolved by rolling back transactions and by checking the generations(see the design of "compare and update" strategy in https://review.openstack.org/#/c/283253/), it also causes additional overhead to db.

Its worth noting this pattern is designed to work well with a Galera
DB cluster, including one that has writes going to all the nodes.

> 3. The db overhead of filtering operation can be relaxed by moving them to schedulers, that will be 38 times faster and can be executed in parallel by schedulers according to the column "Placement avg query time". See http://paste.openstack.org/show/488715/
> 4. The "compare and update" overhead can be partially relaxed by using distributed resource claims in resource trackers. There is no need to roll back transactions in updating inventories of compute local resources in order to be accurate. It is confirmed by checking the db records at the end of each run of eventually consistent scheduler state design.
> 5. If a) all the filtering operations are done inside schedulers,
>         b) schedulers do not need to refresh caches from db because of incremental updates,
>         c) it is no need to do "compare and update" to compute-local resources(i.e. none-shared resources),
>      then here is the performance comparison using 1 scheduler instances: http://paste.openstack.org/show/488717/

The other side of the coin here is sharding.

For example, we could have a dedicated DB cluster for just the
scheduler data (need to add code to support that, but should be
possible now, I believe).

Consider if you have three types of hosts, that map directly to
specific flavors. You can shard your scheduler and DB clusters into
those groups (i.e. compute node inventory lives only in one of the
shards). When the request comes in you just route appropriate build
requests to each of the scheduler clusters.

If you have a large enough deployment, you can shard your hosts across
several DB clusters, and use a modulo or random sharding stragegy to
pick which cluster the request lands on. There are issues around
ensuring you do capacity planning that takes into account those
shards, but aligning the shards with Availability Zones, or similar,
might stop that being an additional burden.

> Finally, it is not fair to directly compare the actual ability of resource-provider scheduler and shared-state scheduler using this benchmarking tool, because there are 300 more processes needed to be created in order to simulate the distributed resource management of 300 compute nodes, and there are no conductors and MQ in the simulation. But I think it is still useful to provide at least some statistics.

Really looking forward to a simulator that can test them all in
slightly more real way (maybe using fake virt, and full everything


More information about the OpenStack-dev mailing list