[openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"
Cheng, Yingxin
yingxin.cheng at intel.com
Wed Mar 2 08:15:58 UTC 2016
On Tuesday, March 1, 2016 7:29 PM, John Garbutt <mailto:john at johngarbutt.com> wrote
> On 1 March 2016 at 08:34, Cheng, Yingxin <yingxin.cheng at intel.com> wrote:
> > Hi,
> >
> > I have simulated the distributed resource management with the incremental
> update model based on Jay's benchmarking framework:
> https://github.com/cyx1231st/placement-bench/tree/shared-state-
> demonstration. The complete result lies at
> http://paste.openstack.org/show/488677/. It's ran by a VM with 4 cores and
> 4GB RAM, and the mysql service is using the default settings with the
> "innodb_buffer_pool_size" setting to "2G". The number of simulated compute
> nodes are set to "300".
> >
> > [...]
> >
> > Second, here's what I've found in the centralized db claim design(i.e. rows that
> "do claim in compute?" = No):
> > 1. The speed of legacy python filtering is not slow(see rows that
> > "Filter strategy" = python): "Placement total query time" records the
> > cost of all query time including fetching states from db and filtering
> > using python. The actual cost of python filtering is
> > (Placement_total_query_time - Placement_total_db_query_time), and
> > that's only about 1/4 of total cost or even less. It also means python
> > in-memory filtering is much faster than db filtering in this
> > experiment. See http://paste.openstack.org/show/488710/
> > 2. The speed of `db filter strategy` and the legacy `python filter
> > strategy` are in the same order of magnitude, not a very huge
> > improvement. See the comparison of column "Placement total query
> > time". Note that the extra cost of `python filter strategy` mainly
> > comes from "Placement total db query time"(i.e. fetching states from
> > db). See http://paste.openstack.org/show/488709/
>
> I think it might be time to run this in a nova-scheduler like
> environment: eventlet threads responding to rabbit, using pymysql backend, etc.
> Note we should get quite a bit of concurrency within a single nova-scheduler
> process with the db approach.
>
> I suspect clouds that are largely full of pets, pack/fill first, with a smaller
> percentage of cattle on top, will benefit the most, as that initial DB filter will
> bring back a small list of hosts.
>
> > Third, my major concern of "centralized db claim" design is: Putting too much
> scheduling works into the centralized db, and it is not scalable by simply adding
> conductors and schedulers.
> > 1. The filtering works are majorly done inside db by executing complex sqls. If
> the filtering logic is much more complex(currently only CPU and RAM are
> accounted in the experiment), the db overhead will be considerable.
>
> So, to clarify, only resources we have claims for in the DB will be filtered in the
> DB. All other filters will still occur in python.
>
> The good news, is that if that turns out to be the wrong trade off, its easy to
> revert back to doing all the filtering in python, with zero impact on the DB
> schema.
Another point is, the db filtering will recalculate every resources to get their free value from inventories and allocations each time when a schedule request comes. This overhead is unnecessary if scheduler can accept the incremental updates to adjust its cache recording free resources.
It also means there must be a mechanism based on strict version control of scheduler caches to make sure those updates are correctly handled.
> > 2. The racing of centralized claims are resolved by rolling back transactions
> and by checking the generations(see the design of "compare and update"
> strategy in https://review.openstack.org/#/c/283253/), it also causes additional
> overhead to db.
>
> Its worth noting this pattern is designed to work well with a Galera DB cluster,
> including one that has writes going to all the nodes.
I know, my point is the "distributed resource management" with resource trackers doesn't need db-locks or db-rolling-backs to those compute-local resources as well as the additional overhead, regardless of the type of databases.
> > 3. The db overhead of filtering operation can be relaxed by moving
> > them to schedulers, that will be 38 times faster and can be executed
> > in parallel by schedulers according to the column "Placement avg query
> > time". See http://paste.openstack.org/show/488715/
> > 4. The "compare and update" overhead can be partially relaxed by using
> distributed resource claims in resource trackers. There is no need to roll back
> transactions in updating inventories of compute local resources in order to be
> accurate. It is confirmed by checking the db records at the end of each run of
> eventually consistent scheduler state design.
> > 5. If a) all the filtering operations are done inside schedulers,
> > b) schedulers do not need to refresh caches from db because of
> incremental updates,
> > c) it is no need to do "compare and update" to compute-local
> resources(i.e. none-shared resources),
> > then here is the performance comparison using 1 scheduler
> > instances: http://paste.openstack.org/show/488717/
>
> The other side of the coin here is sharding.
>
> For example, we could have a dedicated DB cluster for just the scheduler data
> (need to add code to support that, but should be possible now, I believe).
>
> Consider if you have three types of hosts, that map directly to specific flavors.
> You can shard your scheduler and DB clusters into those groups (i.e. compute
> node inventory lives only in one of the shards). When the request comes in you
> just route appropriate build requests to each of the scheduler clusters.
>
> If you have a large enough deployment, you can shard your hosts across several
> DB clusters, and use a modulo or random sharding stragegy to pick which cluster
> the request lands on. There are issues around ensuring you do capacity planning
> that takes into account those shards, but aligning the shards with Availability
> Zones, or similar, might stop that being an additional burden.
I think both designs can support sharding. If I understand it correctly, it means the scheduler will use the "modulo" partition strategy as in the experiment.
As you can see, there are no performance or accuracy regression in the emulation of my design.
> > Finally, it is not fair to directly compare the actual ability of resource-provider
> scheduler and shared-state scheduler using this benchmarking tool, because
> there are 300 more processes needed to be created in order to simulate the
> distributed resource management of 300 compute nodes, and there are no
> conductors and MQ in the simulation. But I think it is still useful to provide at
> least some statistics.
>
> Really looking forward to a simulator that can test them all in slightly more real
> way (maybe using fake virt, and full everything else?).
I've made good progresses on that, almost the same with what you are expecting, so wait and see :-)
>
> Thanks,
> John
Regards,
-Yingxin
More information about the OpenStack-dev
mailing list