[openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

Sylvain Bauza sbauza at redhat.com
Fri Mar 4 08:59:53 UTC 2016

Both of you, thanks for your insights. Greatly appreciated.

Le 04/03/2016 09:25, Cheng, Yingxin a écrit :
> Hi,
> First of all, many delayed thanks to Jay Pipes's benchmarking framework, learnt a lot from it : )
> Other comments inline.
> On Friday, March 4, 2016 8:42 AM Jay Pipes wrote:
>> Hi again, Yingxin, sorry for the delayed response... been traveling.
>> Comments inline :)
>> On 03/01/2016 12:34 AM, Cheng, Yingxin wrote:
>>> Hi,
>>> I have simulated the distributed resource management with the incremental
>> update model based on Jay's benchmarking framework:
>> https://github.com/cyx1231st/placement-bench/tree/shared-state-
>> demonstration. The complete result lies at
>> http://paste.openstack.org/show/488677/. It's ran by a VM with 4 cores and
>> 4GB RAM, and the mysql service is using the default settings with the
>> "innodb_buffer_pool_size" setting to "2G". The number of simulated compute
>> nodes are set to "300".
>> A few things.
>> First, in order to make any predictions or statements about a potential
>> implementation's scaling characteristics, you need to run the benchmarks with
>> increasing levels of compute nodes. The results you show give us only a single
>> dimension of scaling (300 compute nodes). What you want to do is run the
>> benchmarks at 100, 200, 400, 800 and 1600 compute node scales. You don't
>> need to run *all* of the different permutations of placement/partition/workers
>> scenarios, of course. I'd suggest just running the none partition strategy and the
>> pack placement strategy at 8 worker processes. Those results will give you (and
>> us!) the data points that will indicate the scaling behaviour of the shared-state-
>> scheduler implementation proposal as the number of compute nodes in the
>> deployment increases. The "none" partitioning strategy represents the reality of
>> the existing scheduler implementation, which does not shard the deployment
>> into partitions but retrieves all compute nodes for the entire deployment on
>> every request to the scheduler's
>> select_destinations() method.
> Hmm... good suggestion. I don't like to run all the benchmarks, either. It makes me wait for a whole day, and so much data to evaluate.
> 300 is the max number for me to test in my environment. Or db will refuse to work because of connection limits, because all those nodes are asking for connections. Should I emulate "conductors" to limit the db connections, or build up a thread pool to connect, or edit db configurations? I'm wondering if I can write a new tool to do tests in more real environment.

Yeah, I think you need to simulate the same conditions than a real 
production environment, where the compute nodes are not actually writing 
on the DB directly, but using conductors rather.
Having the same model (ie. AMQP casts to the conductor and conductor 
worker writing on the DB) would help us having better accurate figures 
for that, and would potentially help you to scale your lab.

Again, we have a fake oslo.messaging driver for this kind of purpose, I 
guess you should take a look on the Nova in-tree functional tests to see 
how we setup that.

>> Secondly, and I'm not sure if you intended this, the code in your
>> compute_node.py file in the placement-bench project is not thread-safe.
>> In other words, your code assumes that only a single process on each compute
>> node could ever actually run the database transaction that inserts allocation
>> records at any time.
> [a]
> So, single threaded in each node is already good enough to support "shared-state" scheduler to make 1000 more decisions per second. And because those claims are made distributedly in nodes, they are actually wrote to db by 300 parallel nodes in nature. AFAIK, the compute node is single threaded,  they actually use greenthreads instead of real threads.

True, but keep in mind that we have a very weird design where the 
compute manager actually initializes 1 or more ResourceTrackers 
(depending on the number of "nodes" attached to the compute-manager 
service) which means that you have potentially synchronized sections 
running concurrently when trying to update the stats.

I would appreciate if you could amend your branch to have a synchronized 
section there : 
that would really simulate how the RT is working. Adding a possibility 
to have this 1:N relationship between node(s) and service would also 
make the simulator closer to the real situation.
>> If you want more than a single process on the compute
>> node to be able to handle claims of resources, you will need to modify that code
>> to use a compare-and-update strategy, checking a "generation" attribute on the
>> inventory record to ensure that another process on the compute node hasn't
>> simultaneously updated the allocations information for that compute node.
> I still don't think the compare-and-update strategy should be forced to "compute-local" resources even if the compute service is changed to use multiple processes. The scheduler decisions to those "compute-local" resources can be checked and confirmed by the accurate in-memory view of local resources in resource tracker, which is really really faster than db operations. And the following inventory insertion can be concurrent without locks.
> The db is only responsible to use "compare-and-update strategy" to claim those shared resources, persist the confirmed scheduler decision with consumption into inventories, then tell compute service that it's OK to start to do the long job in spawning the VM.

No, I think Jay's right, you should manage a compare-and-update strategy 
when writing the "compute claim" even if that's not necessary yet for 
the exact purpose of the 1:N relationship I mentioned above.

>> Third, you have your scheduler workers consuming messages off the request
>> queue using get_nowait(), while you left the original placement scheduler using
>> the blocking get() call. :) Probably best to compare apples to apples and have
>> them both using the blocking get() call.
> Sorry, I don't agree with this. Consuming messages and getting requests are entirely two different things. I've tried to add timer around the "get()" method, there are no blocks actually, because the requests are already prepared and put into the queue. Note there is a "None" for every schedulers at the end of request queue, and the emulated scheduler will stop getting more requests immediately if there are no more requests. There is no wait at all.
> Also, if look into the code, the "get_nowait()" is simply a implementation to process available messages ASAP during scheduling. The "shared-state" emulation will actually be BLOCKED to wait for messages from compute nodes at the end, until all the requests are confirmed "succeeded" or "failed" by schedule. That wait time is already taken into consideration to calculate the "Total wallclock time". And that's why "Placement query count" = "Placement found provider count" + "Placement no found provider count" in the shared-state emulation.

That's a fair point, I tend to agree, things are different.
>>> First, the conclusions from the result of the eventually consistent scheduler
>> state simulation(i.e. rows that "do claim in compute?" = Yes):
>>> #accuracy
>>> 1. The final decision accuracy is 100%: No resource usage will exceed the real
>> capacity by examining the rationality of db records at the end of each run.
>> Again, with your simulation, this assumes only a single thread will ever attempt a
>> claim on each compute node at any given time.
> Explained above in [a].
>>> 2. The schedule decision accuracy is 100% if there's only one
>>> scheduler: The successful scheduler decisions are all succeeded in
>>> compute nodes, thus no retries recorded, i.e. "Count of requests
>>> processed" = "Placement query count". See
>>> http://paste.openstack.org/show/488696/
>> Yep, no disagreement here :)
>>> 3. The schedule decision accuracy is 100% if "Partition strategy" is
>>> set to "modulo", no matter how many scheduler processes. See
>>> http://paste.openstack.org/show/488697/
>>> #racing
>> Yep, modulo partitioning eliminates the race conditions when the number of
>> partitions == the number of worker processes. However, this isn't representative
>> of the existing scheduler system which processes every compute node in the
>> deployment on every call to select_destinations().
> Agreed.
>> What happens in the shared-state-scheduler approach when you want to scale
>> the scheduler process out with more scheduler processes handling more load?
> Yes, it needs to be carefully evaluated.
>> What about having two scheduler processes handling the scheduling to the same
>> partition (i.e. highly-available scheduling)?
> No, if we only want HA scheduling, why would we want schedulers to fight each other? We should instead add a passive-active model to HA schedulers. The passive scheduler should only accept resource updates but does no scheduling until another scheduler is failed.

I strongly disagree.
Our OpenStack design tenets [dt] don't imply any active-passive setup, 
rather a shared-nothing approach. So having independent schedulers 
running parallely is a key thing, and I wouldn't accept a limitation 
like this that lock a certain type of deployment.

Jay's point is valid. If you're running two scheduler processes at the 
same time, they would possibly have different views.
That said, I think it's perfectly okay to assume that there could be a 
discrepancy leading to retries. That's the trade-off of having the 
compute nodes owning the information, but I just think the retry ratio 
should be far lower than what's we have at the moment, and that's what 
I'd like to see in your placement benchmark, Yingxin (ie. is the retry 
ratio very low and not exponentially increasing when adding more 
computes or requests)

[dt] https://wiki.openstack.org/wiki/BasicDesignTenets
>> Both of these situations will introduce contention into the scheduling process
>> and introduce races that will manifest themselves on the compute nodes instead
>> of in the scheduler processes themselves where the total deadlock and retry
>> time can be limited.
>>> 4. No racing is happened if there's only one scheduler process or the "Partition
>> strategy" is set to "modulo", explained by 2. 3.
>> Yes, no disagreement.
>>> 5. The multiple-schedulers racing rate is extremely low using the
>>> "spread" or "random" placement strategy used by legacy filter
>>> scheduler: This rate is 3.0% using "spread" strategy and 0.15% using
>>> "random" strategy, note that there are 8 workers in processing about
>>> 12000 requests within 20 seconds. The result is even better than
>>> resource-provider scheduler(rows that "do claim in compute?" = No),
>>> that's 82.9% using "spread" strategy and 2.52% using "random" strategy
>>> of 12000 requests within 70-190 seconds. See
>>> http://paste.openstack.org/show/488699/. Note, retry rate is
>>> calculated by (Placement query count - Count of requests processed) /
>>> Count of requests processed * 100%
>> I think you will find different numbers when we introduce the cost of the
>> messages into the system.
> Agreed.
>> In addition, I'd like to point out that the retries when
>> done in the claim-in-scheduler solution are on the order of microseconds
>> whereas retries in the legacy solution are on the order of seconds. The
>> retry/race-for-last-spot-on-node issue can be addressed using partitioning, as
>> mentioned above.
> In the claim-in-scheduler design, the claims to those compute-local resources from different schedulers will fight with each other, inside the centralized db, even if they won't actually overcommit the resources, and it causes unnecessary retries.
> But in the distributed claim design, all the claims to the same compute-local resources will be sent to the corresponding compute. There are no such fightings because compute node will first check those "compute-local" resources consumptions using its accurate in-memory view, which is really fast.
>>> #overwhelming messages
>>> 6. The total count of messages are only affected by the number of
>>> schedulers and the number of schedule queries, NOT by the number of
>>> compute nodes. See http://paste.openstack.org/show/488701/
>> True, however you are not actually showing the true cost of messages in the
>> system in the benchmark utility. An IPC queue push/pop != an AMQP broker
>> push/pop :)
> Truly is.
>>> 7. The messages per successful query is (number_of_schedulers + 2)
>> That is still a significant increase in the number of messages over the resource-
>> providers-claim-in-scheduler approach though, so it's important that we
>> accurately account for the message costs. Not saying that you are wrong, just
>> want to be fair and account for the message costs accurately if we can.
> True, I still don't know the actual cost.
>>   >, its growth pattern is lineral and only affected by scheduler processes. And
>> there is only 1 message if the query failed. It is not a huge number plus there are
>> no additional messages in order to access db during scheduling.
>> While it's a linear growth pattern, it's still a significant coefficient that will
>> multiply the number of messages sent over the bus by the number of scheduler
>> processes. 1000 requests to launch an instance with
>> 1 scheduler process will produce 1000 messages over the bus. 8 scheduler
>> processes turns into 8000 messages.
> Yes, but it depends on how much performance gain we can get from "shared-state" design. If the performance can be multiplied by 10, then we don't need to care about 8 schedulers. It's not cleared yet.

Honestly, that's a very good point, and I agree with Jay, we need to see 
how things evolve at scale.
I just think that adding more messages to the queue is a reasonable 
trade-off because we have the same bottleneck as we have at the moment, 
and that operators know how to face with.
What I still don't know is if that message increase is reasonable or 
just flooding the MQ so that it would require adding a lot more of 
conductors and that's my concern.

>> And while there are no additional messages to update the database, the
>> database updates themselves don't go away. Instead of X number of scheduler
>> processes updating the database, you have Y number of compute nodes sending
>> those updates. And if you use multiple threads on each compute node to handle
>> launch requests, you will have Y * <threads> connections to the database. If you
>> will use the conductor to throttle those now-distributed DB writes, you will need
>> to send a message to the conductor to update the DB, and you're back to having
>> more messages in the system :)
> My point is, 1 "shared-state" scheduler will have the least messages without extra overhead to db.
> If there should be multiple schedulers because of mass requirement, we should then care about the gain of less db overhead and the loss of extra messages.
>>> Second, here's what I've found in the centralized db claim design(i.e. rows that
>> "do claim in compute?" = No):
>>> 1. The speed of legacy python filtering is not slow(see rows that
>>> "Filter strategy" = python): "Placement total query time" records the
>>> cost of all query time including fetching states from db and filtering
>>> using python. The actual cost of python filtering is
>>> (Placement_total_query_time - Placement_total_db_query_time), and
>>> that's only about 1/4 of total cost or even less. It also means python
>>> in-memory filtering is much faster than db filtering in this
>>> experiment. See http://paste.openstack.org/show/488710/
>> Heh, you only tested the above on 300 compute nodes :) The more compute
>> nodes you have in the deployment, the more work the Python legacy filtering
>> needs to do. This is why it's important to re-run the benchmarks at varying scales
>> of compute nodes. I found that at small numbers of compute nodes, the
>> difference you see isn't great -- just like you show in the above results. However,
>> the savings you see from doing the filtering on the DB side increase the bigger
>> the number of compute nodes in the system.
> That's my bad. I'm too eager to show the results without solving the "too many connections" problem. :(
>>> 2. The speed of `db filter strategy` and the legacy `python filter
>>> strategy` are in the same order of magnitude, not a very huge
>>> improvement. See the comparison of column "Placement total query
>>> time". Note that the extra cost of `python filter strategy` mainly
>>> comes from "Placement total db query time"(i.e. fetching states from
>>> db). See http://paste.openstack.org/show/488709/
>> You need to test at more than a single dimension of the number of compute
>> nodes :)
>>> Third, my major concern of "centralized db claim" design is: Putting too much
>> scheduling works into the centralized db, and it is not scalable by simply adding
>> conductors and schedulers.
>> This is certainly a valid concern, also raised by Sylvain. I'd like to point out that
>> there is an additional benefit to the centralized DB claim strategy around having
>> a single pane of glass view of all resources in the system, including shared
>> resource pool inventory. This is one thing that is problematic if you claim and
>> account for inventory on the distributed compute nodes. Those nodes that share
>> resources on, say, a shared storage system, don't have the ability to properly
>> claim resources or accurately report capacity and usage.
Not sure I get your worries, Jay. We can still support shared resources 
with your inventories and allocations tables, that's the whole purpose 
of the resource-providers series.
There are 6 (or more) specs and generic-resource-pools is focusing on 
that particular shared resources problem.

That's not because we have distributed compute nodes that we don't have 
a centralized DB, right?
I honestly take that point as a strawman, or I misunderstood something 
really important.

> I'm fully support and admire your design in shared resources management part, and in resource abstraction part. I only have concerns to move those "compute-local" resource management from resource trackers to the centralized db.
> I'm also very curious about whether it is possible to let resource trackers do the claims to those shared resources using your design.
>>> 1. The filtering works are majorly done inside db by executing complex sqls. If
>> the filtering logic is much more complex(currently only CPU and RAM are
>> accounted in the experiment), the db overhead will be considerable.
>> I continue to ask that we back up statements like this with data. Let's not
>> assume that a SQL statement that may seem complex at first glance will actually
>> be inefficient when executed by a query planner :)
> Sorry, I see in http://paste.openstack.org/show/488709/ that the SQL filtering won't be that faster than legacy state refreshing from db(300 nodes). And the in-memory filtering can be 4 times faster or even 100 times faster (see 1 worker scenario in http://paste.openstack.org/show/488717/, column "Placement avg query time") without optimization.
> And what do you mean by "query planner"? Is that a feature of mysql?
>>> 2. The racing of centralized claims are resolved by rolling back transactions
>> and by checking the generations(see the design of "compare and update"
>> strategy in https://review.openstack.org/#/c/283253/), it also causes additional
>> overhead to db.
>> Yes, this is absolutely correct. However, please see my note about the
>> compute_node.py code in your benchmark program being single-threaded and if
>> you want it to be multi-threaded you would need to similarly handle concurrent
>> writes to the same hot data location.
> Please see [a].
>>> 3. The db overhead of filtering operation can be relaxed by moving
>>> them to schedulers, that will be 38 times faster and can be executed
>>> in parallel by schedulers according to the column "Placement avg query
>>> time". See http://paste.openstack.org/show/488715/
>>> 4. The "compare and update" overhead can be partially relaxed by using
>> distributed resource claims in resource trackers. There is no need to roll back
>> transactions in updating inventories of compute local resources in order to be
>> accurate. It is confirmed by checking the db records at the end of each run of
>> eventually consistent scheduler state design.
>> Again, here you are assuming a single-threaded compute node.
> Please refer to [a].
>>> 5. If a) all the filtering operations are done inside schedulers,
>>>           b) schedulers do not need to refresh caches from db because of
>> incremental updates,
>>>           c) it is no need to do "compare and update" to compute-local
>> resources(i.e. none-shared resources),
>>>        then here is the performance comparison using 1 scheduler
>>> instances: http://paste.openstack.org/show/488717/
>>> Finally, it is not fair to directly compare the actual ability of resource-provider
>> scheduler and shared-state scheduler using this benchmarking tool, because
>> there are 300 more processes needed to be created in order to simulate the
>> distributed resource management of 300 compute nodes, and there are no
>> conductors and MQ in the simulation. But I think it is still useful to provide at
>> least some statistics.
>> Agreed. I've been trying to think of a way that would be fair to emulate the
>> shared-state-scheduler approach and have yet to come up with a good
>> solution :( I'll try on the plane ride back home tomorrow to give this some
>> thought. :)
> Shed some lights here :)
> What I'm trying to do is to inject additional logs to all critical places in nova-api, nova-scheduler, nova-compute and nova-conductor. Then use the fake virt driver and fake hosts to start compute nodes into processes. Finally I can analyze the logs with timestamps and collect statistics in processing multiple schedule requests.

Cool, ping us if you need assistance.

>> Best,
>> -jay
>>> Regards,
>>> -Yingxin
> Regards,
> -Yingxin

More information about the OpenStack-dev mailing list