[openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

Jay Pipes jaypipes at gmail.com
Fri Mar 4 00:42:04 UTC 2016

Hi again, Yingxin, sorry for the delayed response... been traveling. 
Comments inline :)

On 03/01/2016 12:34 AM, Cheng, Yingxin wrote:
> Hi,
> I have simulated the distributed resource management with the incremental update model based on Jay's benchmarking framework: https://github.com/cyx1231st/placement-bench/tree/shared-state-demonstration. The complete result lies at http://paste.openstack.org/show/488677/. It's ran by a VM with 4 cores and 4GB RAM, and the mysql service is using the default settings with the "innodb_buffer_pool_size" setting to "2G". The number of simulated compute nodes are set to "300".

A few things.

First, in order to make any predictions or statements about a potential 
implementation's scaling characteristics, you need to run the benchmarks 
with increasing levels of compute nodes. The results you show give us 
only a single dimension of scaling (300 compute nodes). What you want to 
do is run the benchmarks at 100, 200, 400, 800 and 1600 compute node 
scales. You don't need to run *all* of the different permutations of 
placement/partition/workers scenarios, of course. I'd suggest just 
running the none partition strategy and the pack placement strategy at 8 
worker processes. Those results will give you (and us!) the data points 
that will indicate the scaling behaviour of the shared-state-scheduler 
implementation proposal as the number of compute nodes in the deployment 
increases. The "none" partitioning strategy represents the reality of 
the existing scheduler implementation, which does not shard the 
deployment into partitions but retrieves all compute nodes for the 
entire deployment on every request to the scheduler's 
select_destinations() method.

Secondly, and I'm not sure if you intended this, the code in your 
compute_node.py file in the placement-bench project is not thread-safe. 
In other words, your code assumes that only a single process on each 
compute node could ever actually run the database transaction that 
inserts allocation records at any time. If you want more than a single 
process on the compute node to be able to handle claims of resources, 
you will need to modify that code to use a compare-and-update strategy, 
checking a "generation" attribute on the inventory record to ensure that 
another process on the compute node hasn't simultaneously updated the 
allocations information for that compute node.

Third, you have your scheduler workers consuming messages off the 
request queue using get_nowait(), while you left the original placement 
scheduler using the blocking get() call. :) Probably best to compare 
apples to apples and have them both using the blocking get() call.

> First, the conclusions from the result of the eventually consistent scheduler state simulation(i.e. rows that "do claim in compute?" = Yes):
> #accuracy
> 1. The final decision accuracy is 100%: No resource usage will exceed the real capacity by examining the rationality of db records at the end of each run.

Again, with your simulation, this assumes only a single thread will ever 
attempt a claim on each compute node at any given time.

> 2. The schedule decision accuracy is 100% if there's only one scheduler: The successful scheduler decisions are all succeeded in compute nodes, thus no retries recorded, i.e. "Count of requests processed" = "Placement query count". See http://paste.openstack.org/show/488696/

Yep, no disagreement here :)

> 3. The schedule decision accuracy is 100% if "Partition strategy" is set to "modulo", no matter how many scheduler processes. See http://paste.openstack.org/show/488697/
> #racing

Yep, modulo partitioning eliminates the race conditions when the number 
of partitions == the number of worker processes. However, this isn't 
representative of the existing scheduler system which processes every 
compute node in the deployment on every call to select_destinations().

What happens in the shared-state-scheduler approach when you want to 
scale the scheduler process out with more scheduler processes handling 
more load? What about having two scheduler processes handling the 
scheduling to the same partition (i.e. highly-available scheduling)? 
Both of these situations will introduce contention into the scheduling 
process and introduce races that will manifest themselves on the compute 
nodes instead of in the scheduler processes themselves where the total 
deadlock and retry time can be limited.

> 4. No racing is happened if there's only one scheduler process or the "Partition strategy" is set to "modulo", explained by 2. 3.

Yes, no disagreement.

> 5. The multiple-schedulers racing rate is extremely low using the "spread" or "random" placement strategy used by legacy filter scheduler: This rate is 3.0% using "spread" strategy and 0.15% using "random" strategy, note that there are 8 workers in processing about 12000 requests within 20 seconds. The result is even better than resource-provider scheduler(rows that "do claim in compute?" = No), that's 82.9% using "spread" strategy and 2.52% using "random" strategy of 12000 requests within 70-190 seconds. See http://paste.openstack.org/show/488699/. Note, retry rate is calculated by (Placement query count - Count of requests processed) / Count of requests processed * 100%

I think you will find different numbers when we introduce the cost of 
the messages into the system. In addition, I'd like to point out that 
the retries when done in the claim-in-scheduler solution are on the 
order of microseconds whereas retries in the legacy solution are on the 
order of seconds. The retry/race-for-last-spot-on-node issue can be 
addressed using partitioning, as mentioned above.

> #overwhelming messages
> 6. The total count of messages are only affected by the number of schedulers and the number of schedule queries, NOT by the number of compute nodes. See http://paste.openstack.org/show/488701/

True, however you are not actually showing the true cost of messages in 
the system in the benchmark utility. An IPC queue push/pop != an AMQP 
broker push/pop :)

> 7. The messages per successful query is (number_of_schedulers + 2)

That is still a significant increase in the number of messages over the 
resource-providers-claim-in-scheduler approach though, so it's important 
that we accurately account for the message costs. Not saying that you 
are wrong, just want to be fair and account for the message costs 
accurately if we can.

 >, its growth pattern is lineral and only affected by scheduler 
processes. And there is only 1 message if the query failed. It is not a 
huge number plus there are no additional messages in order to access db 
during scheduling.

While it's a linear growth pattern, it's still a significant coefficient 
that will multiply the number of messages sent over the bus by the 
number of scheduler processes. 1000 requests to launch an instance with 
1 scheduler process will produce 1000 messages over the bus. 8 scheduler 
processes turns into 8000 messages.

And while there are no additional messages to update the database, the 
database updates themselves don't go away. Instead of X number of 
scheduler processes updating the database, you have Y number of compute 
nodes sending those updates. And if you use multiple threads on each 
compute node to handle launch requests, you will have Y * <threads> 
connections to the database. If you will use the conductor to throttle 
those now-distributed DB writes, you will need to send a message to the 
conductor to update the DB, and you're back to having more messages in 
the system :)

> Second, here's what I've found in the centralized db claim design(i.e. rows that "do claim in compute?" = No):
> 1. The speed of legacy python filtering is not slow(see rows that "Filter strategy" = python): "Placement total query time" records the cost of all query time including fetching states from db and filtering using python. The actual cost of python filtering is (Placement_total_query_time - Placement_total_db_query_time), and that's only about 1/4 of total cost or even less. It also means python in-memory filtering is much faster than db filtering in this experiment. See http://paste.openstack.org/show/488710/

Heh, you only tested the above on 300 compute nodes :) The more compute 
nodes you have in the deployment, the more work the Python legacy 
filtering needs to do. This is why it's important to re-run the 
benchmarks at varying scales of compute nodes. I found that at small 
numbers of compute nodes, the difference you see isn't great -- just 
like you show in the above results. However, the savings you see from 
doing the filtering on the DB side increase the bigger the number of 
compute nodes in the system.

> 2. The speed of `db filter strategy` and the legacy `python filter strategy` are in the same order of magnitude, not a very huge improvement. See the comparison of column "Placement total query time". Note that the extra cost of `python filter strategy` mainly comes from "Placement total db query time"(i.e. fetching states from db). See http://paste.openstack.org/show/488709/

You need to test at more than a single dimension of the number of 
compute nodes :)

> Third, my major concern of "centralized db claim" design is: Putting too much scheduling works into the centralized db, and it is not scalable by simply adding conductors and schedulers.

This is certainly a valid concern, also raised by Sylvain. I'd like to 
point out that there is an additional benefit to the centralized DB 
claim strategy around having a single pane of glass view of all 
resources in the system, including shared resource pool inventory. This 
is one thing that is problematic if you claim and account for inventory 
on the distributed compute nodes. Those nodes that share resources on, 
say, a shared storage system, don't have the ability to properly claim 
resources or accurately report capacity and usage.

> 1. The filtering works are majorly done inside db by executing complex sqls. If the filtering logic is much more complex(currently only CPU and RAM are accounted in the experiment), the db overhead will be considerable.

I continue to ask that we back up statements like this with data. Let's 
not assume that a SQL statement that may seem complex at first glance 
will actually be inefficient when executed by a query planner :)

> 2. The racing of centralized claims are resolved by rolling back transactions and by checking the generations(see the design of "compare and update" strategy in https://review.openstack.org/#/c/283253/), it also causes additional overhead to db.

Yes, this is absolutely correct. However, please see my note about about 
the compute_node.py code in your benchmark program being single-threaded 
and if you want it to be multi-threaded you would need to similarly 
handle concurrent writes to the same hot data location.

> 3. The db overhead of filtering operation can be relaxed by moving them to schedulers, that will be 38 times faster and can be executed in parallel by schedulers according to the column "Placement avg query time". See http://paste.openstack.org/show/488715/
> 4. The "compare and update" overhead can be partially relaxed by using distributed resource claims in resource trackers. There is no need to roll back transactions in updating inventories of compute local resources in order to be accurate. It is confirmed by checking the db records at the end of each run of eventually consistent scheduler state design.

Again, here you are assuming a single-threaded compute node.

> 5. If a) all the filtering operations are done inside schedulers,
>          b) schedulers do not need to refresh caches from db because of incremental updates,
>          c) it is no need to do "compare and update" to compute-local resources(i.e. none-shared resources),
>       then here is the performance comparison using 1 scheduler instances: http://paste.openstack.org/show/488717/
> Finally, it is not fair to directly compare the actual ability of resource-provider scheduler and shared-state scheduler using this benchmarking tool, because there are 300 more processes needed to be created in order to simulate the distributed resource management of 300 compute nodes, and there are no conductors and MQ in the simulation. But I think it is still useful to provide at least some statistics.

Agreed. I've been trying to think of a way that would be fair to emulate 
the shared-state-scheduler approach and have yet to come up with a good 
solution :( I'll try on the plane ride back home tomorrow to give this 
some thought. :)


> Regards,
> -Yingxin
> On Wednesday, February 24, 2016 5:04 PM, Cheng, Yingxin wrote:
>> Very sorry for the delay, it feels hard for me to reply to all the concerns raised,
>> most of you have years more experiences. I've tried hard to present that there is
>> a direction to solve the issues of existing filter scheduler in multiple areas
>> including performance, decision accuracy, multiple scheduler support and race
>> condition. I'll also support any other solutions if they can solve the same issue
>> elegantly.
>> @Jay Pipes:
>> I feel that scheduler team will agree with the design that can fulfill thousands of
>> placement queries with thousands of nodes. But as a scheduler that will be
>> splitted out to be used in wider areas, it's not simple to predict the that
>> requirement, so I'm not agree with the statement "It doesn't need to be high-
>> performance at all". There is no system that can be existed without performance
>> bottleneck, including resource-provider scheduler and shared-state scheduler. I
>> was trying to point out where is the potential bottleneck in each design and how
>> to improve if the worst thing happens, quote:
>> "The performance bottleneck of resource-provider and legacy scheduler is from
>> the centralized db (REMOVED: and scheduler cache refreshing). It can be
>> alleviated by changing to a stand-alone high performance database. And the
>> cache refreshing is designed to be replaced by to direct SQL queries according to
>> resource-provider scheduler spec. The performance bottleneck of shared-state
>> scheduler may come from the overwhelming update messages, it can also be
>> alleviated by changing to stand-alone distributed message queue and by using
>> the "MessagePipe" to merge messages."
>> I'm not saying that there is a bottleneck of resource-provider scheduler in
>> fulfilling current design goal. The ability of resource-provider scheduler is
>> already proven by a nice modeling tool implemented by Jay, I trust it. But I care
>> more about the actual limit of each design and how easily they can be extended
>> to increase that limit. That's why I turned to make efforts in scheduler functional
>> test framework(https://review.openstack.org/#/c/281825/). I finally want to
>> test scheduler functionality using greenthreads in the gate, and test the
>> performance and placement of each design using real processes. And I hope the
>> scheduler can open to both centralized and distributed design.
>> I've updated my understanding of three designs:
>> https://docs.google.com/document/d/1iNUkmnfEH3bJHDSenmwE4A1Sgm3vnq
>> a6oZJWtTumAkw/edit?usp=sharing
>> The "cache updates" arrows are changed to "resource updates" in resource-
>> provider scheduler, because I think resource updates from virt driver are still
>> needed to be updated to the central database. Hope this time it's right.
>> @Sylvain Bauza:
>> As the first step towards shared-state scheduler, the required changes are kept
>> at minimum. There are no db modifications needed, so no rolling upgrade issues
>> in data migration. The new scheduler can decide not to make decisions to old
>> compute nodes, or try to refresh host states from db and use legacy way to
>> make decisions until all the compute nodes are upgraded.
>> I have to admit that my prototype still lack efforts to deal with overwhelming
>> messages. This design works best using the distributed message queue. Also if
>> we try to initiate multiple scheduler processes/workers in a single host, there are
>> a lot more to be done to reduce update messages between compute nodes and
>> scheduler workers. But I see the potential of distributed resource
>> management/scheduling and would like to make efforts in this direction.
>> If we are agreed that the final decision accuracy are guaranteed in both
>> directions, we should care more about the final decision throughput of both
>> design. Theoretically it is better because the final consumptions are made
>> distributedly, but there exists difficulties in reaching that limit. However, the
>> centralized design is easier to approach its theoretical performance because of
>> the lightweight implementation inside scheduler and the powerful underlying
>> database.
>> Regards,
>> -Yingxin

More information about the OpenStack-dev mailing list