[openstack-dev] Scheduler proposal
Monty Taylor
mordred at inaugust.com
Mon Oct 12 19:05:53 UTC 2015
On 10/12/2015 02:45 PM, Joshua Harlow wrote:
> Alec Hothan (ahothan) wrote:
>>
>>
>>
>>
>> On 10/10/15, 11:35 PM, "Clint Byrum"<clint at fewbar.com> wrote:
>>
>>> Excerpts from Alec Hothan (ahothan)'s message of 2015-10-09 21:19:14
>>> -0700:
>>>> On 10/9/15, 6:29 PM, "Clint Byrum"<clint at fewbar.com> wrote:
>>>>
>>>>> Excerpts from Chris Friesen's message of 2015-10-09 17:33:38 -0700:
>>>>>> On 10/09/2015 03:36 PM, Ian Wells wrote:
>>>>>>> On 9 October 2015 at 12:50, Chris
>>>>>>> Friesen<chris.friesen at windriver.com
>>>>>>> <mailto:chris.friesen at windriver.com>> wrote:
>>>>>>>
>>>>>>> Has anybody looked at why 1 instance is too slow and what it
>>>>>>> would take to
>>>>>>>
>>>>>>> make 1 scheduler instance work fast enough? This does
>>>>>>> not preclude the
>>>>>>> use of
>>>>>>> concurrency for finer grain tasks in the background.
>>>>>>>
>>>>>>>
>>>>>>> Currently we pull data on all (!) of the compute nodes out
>>>>>>> of the database
>>>>>>> via a series of RPC calls, then evaluate the various filters
>>>>>>> in python code.
>>>>>>>
>>>>>>>
>>>>>>> I'll say again: the database seems to me to be the problem here.
>>>>>>> Not to
>>>>>>> mention, you've just explained that they are in practice holding
>>>>>>> all the data in
>>>>>>> memory in order to do the work so the benefit we're getting here
>>>>>>> is really a
>>>>>>> N-to-1-to-M pattern with a DB in the middle (the store-to-DB is
>>>>>>> rather
>>>>>>> secondary, in fact), and that without incremental updates to the
>>>>>>> receivers.
>>>>>> I don't see any reason why you couldn't have an in-memory scheduler.
>>>>>>
>>>>>> Currently the database serves as the persistant storage for the
>>>>>> resource usage,
>>>>>> so if we take it out of the picture I imagine you'd want to have
>>>>>> some way of
>>>>>> querying the compute nodes for their current state when the
>>>>>> scheduler first
>>>>>> starts up.
>>>>>>
>>>>>> I think the current code uses the fact that objects are remotable
>>>>>> via the
>>>>>> conductor, so changing that to do explicit posts to a known
>>>>>> scheduler topic
>>>>>> would take some work.
>>>>>>
>>>>> Funny enough, I think thats exactly what Josh's "just use Zookeeper"
>>>>> message is about. Except in memory, it is "in an observable storage
>>>>> location".
>>>>>
>>>>> Instead of having the scheduler do all of the compute node inspection
>>>>> and querying though, you have the nodes push their stats into
>>>>> something
>>>>> like Zookeeper or consul, and then have schedulers watch those stats
>>>>> for changes to keep their in-memory version of the data up to date. So
>>>>> when you bring a new one online, you don't have to query all the
>>>>> nodes,
>>>>> you just scrape the data store, which all of these stores (etcd,
>>>>> consul,
>>>>> ZK) are built to support atomically querying and watching at the same
>>>>> time, so you can have a reasonable expectation of correctness.
>>>>>
>>>>> Even if you figured out how to make the in-memory scheduler crazy
>>>>> fast,
>>>>> There's still value in concurrency for other reasons. No matter how
>>>>> fast you make the scheduler, you'll be slave to the response time of
>>>>> a single scheduling request. If you take 1ms to schedule each node
>>>>> (including just reading the request and pushing out your scheduling
>>>>> result!) you will never achieve greater than 1000/s. 1ms is way lower
>>>>> than it's going to take just to shove a tiny message into RabbitMQ or
>>>>> even 0mq.
>>>> That is not what I have seen, measurements that I did or done by
>>>> others show between 5000 and 10000 send *per sec* (depending on
>>>> mirroring, up to 1KB msg size) using oslo messaging/kombu over
>>>> rabbitMQ.
>>> You're quoting througput of RabbitMQ, but how many threads were
>>> involved? An in-memory scheduler that was multi-threaded would need to
>>> implement synchronization at a fairly granular level to use the same
>>> in-memory store, and we're right back to the extreme need for efficient
>>> concurrency in the design, though with much better latency on the
>>> synchronization.
>>
>> These were single-threaded tests and you're correct that if you had
>> multiple threads trying to send something you'd have some inefficiency.
>> However I'd question the likelihood of that happening as it is very
>> likely that most of the cpu time will be spent outside of oslo
>> messaging code.
>>
>> Furthermore, Python does not need multiple threads to go faster. As a
>> matter of fact, for in-memory operations, it could end up being slower
>> because of the inherent design of the interpreter (and there are many
>> independent measurements that have shown it).
>>
>>
>>>> And this is unmodified/highly unoptimized oslo messaging code.
>>>> If you remove the oslo messaging layer, you get 25000 to 45000
>>>> msg/sec with kombu/rabbitMQ (which shows how inefficient is oslo
>>>> messaging layer itself)
>>>>
>>>>> So I'm pretty sure this is o-k for small clouds, but would be
>>>>> a disaster for a large, busy cloud.
>>>> It all depends on how many sched/sec for the "large busy cloud"...
>>>>
>>> I think there are two interesting things to discern. Of course, the
>>> exact rate would be great to have as a target, but operational security
>>> and just plain secrecy of business models will probably prevent us from
>>> getting at many of these requirements.
>>
>> I don't think that is the case. We have no visibility because nobody
>> has really thought about these numbers. Ops should be ok to provide
>> some rough requirement numbers if asked (everybody is in the same boat).
>>
>>
>>> The second is the complexity model of scaling. We can just think about
>>> the actual cost benefit of running 1, 3, and more schedulers and come up
>>> with some rough numbers for a lower bounds for scheduler performance
>>> that would make sense.
>>>
>>>>> If, however, you can have 20 schedulers that all take 10ms on average,
>>>>> and have the occasional lock contention for a resource counter
>>>>> resulting
>>>>> in 100ms, now you're at 2000/s minus the lock contention rate. This
>>>>> strategy would scale better with the number of compute nodes, since
>>>>> more nodes means more distinct locks, so you can scale out the number
>>>>> of running servers separate from the number of scheduling requests.
>>>> How many compute nodes are we talking about max? How many scheduling
>>>> per second is the requirement? And where are we today with the
>>>> latest nova scheduler?
>>>> My point is that without these numbers we could end up
>>>> under-shooting, over-shooting or over-engineering along with the
>>>> cost of maintaining that extra complexity over the lifetime of
>>>> openstack.
>>>>
>>>> I'll just make up some numbers for the sake of this discussion:
>>>>
>>>> nova scheduler latest can do only 100 sched/sec for 1 instance (I
>>>> guess the 10ms average you bring out may not be that unrealistic)
>>>> the requirement is a sustained 500 sched/sec worst case with 10K
>>>> nodes (that is 5% of 10K and today we can barely launch 100VM/sec
>>>> sustained)
>>>>
>>>> Are we going to achieve 5x with just 3 instances which is what most
>>>> people deploy? Not likely.
>>>> Will using more elaborate distributed infra/DLM like consul/zk/etcd
>>>> going to get us to that 500 mark with 3 instances? Maybe but it will
>>>> be at the expense of added complexity of the overall solution.
>>>> Can we instead optimize nova scheduler with single instance to do
>>>> 500/sec? Maybe but if we succeed we'll get a lot more simple solution.
>>>>
>>> Of course we can. And simple solutions are great. So if you can get one
>>> node to do all the work and the cloud doesn't fall over dead when it
>>> dies
>>> (because you have more waiting on standby), that would be fantastic.
>>>
>>> I'm dubious that this will be a solution that works for large deployers.
>>> But I may be wrong!
>>
>> Control plane apps that had to do more complex work than scheduling
>> instances have been working for a long time at scale using simple
>> active/standby designs.
>> In this case it is easier because we can even afford some scheduling
>> disruption when your active goes down (as long as the standby can pick
>> up most of the pending requests). Heck when an entire openstack
>> controller goes down, I bet failing a few request will be the least of
>> your concerns (because a lot more other things will go wrong).
>>
>>
>>>> Not saying that high concurrency and distributed schedulers is not
>>>> the solution and maybe we really need a distributed solution, but
>>>> it'd be good to have some numbers to frame the discussion.
>>>>
>>> Indeed, however, I don't think we can expect those numbers to
>>> materialize
>>> in public, ever. We have to make some informed guesses and see how the
>>> product managers respond when we tell them what we think should
>>> happen. ;)
>>
>>
>> Isn't that a concern? Shouldn't the TC provide at least some numbers
>> for the scale range so we do not under/over engineer? Any number would
>> be better than no number at all.
>>
>> It is an open secret that Nova/Neutron does not scale well below the
>> 1000-node mark (400/500 seems to be a threshold to do anything serious
>> in production).
>> If we look at all the openstack/Neutron deployments today, nobody has
>> an exact count, but very likely the vast majority of deployments are
>> smaller than 100 nodes. Few are over 100 nodes, even less over 500
>> (well we know some companies/orgs have deployed thousands but none
>> with Neutron).
>> For those 99% of deployments, we only need to service less than 100
>> nodes. In that space the inability to schedule over 100 instances per
>> second is probably not that important. If the problem we are trying to
>> solve here is "scheduling is too slow", then the first question to ask
>> is how much faster do we need to go, then brainstorm on what would be
>> the best way to achieve.
>> I have no problem believing we can schedule a lot more instances/sec
>> by using scale out design with DLMs but there is no free lunch. Is it
>> fair to impose on those 99% deployments the complexity of a machinery
>> that is designed to handle 1000 schedules/sec? Ops are very concerned
>> about adding even more complexity to an already complex platform and I
>> think we should be considerate to that, meaning worry about the cost
>> impact in HW, config, deployment, troubleshooting.
>>
>
> Ah there-in is the question, where do we want to go... This is a tough
> one to answer, and I can offer my inputs (mine would be ~5000
> hypervisors to start and bigger as we get there, with a instance
> bring-up/tear-down rate of let's say 50 instances per second to start
> and a failure rate being well as low as we can get it), but my inputs
> aren't likely others inputs. Sadly if this isn't 1000+ nodes then I
> would start to believe that bigger players will start looking elsewhere
> for solutions. If openstack just wants to focus on the < 500 that's
> cool, but the choice will IMHO decide the future of where openstack
> goes. So it's a tricky question to answer, but a good one to ask and
> think about :)
I want to do 100k hypervisors. No, that's not hyperbole.
Also, I do not think that ZK/consul/etcd are very costly for small
deployments. Given the number of simple dev-oriented projects that start
with "so install ZK/consul/etcd" I think they've all proven their
ability to scale _down_ - and I'm also pretty sure all of them have
installations that clear 100k nodes.
This:
to produce the ubiquitous Open Source Cloud Computing platform that will
meet the needs of public and private clouds regardless of size, by being
simple to implement and massively scalable.
is what we're doing.
Our mission is NOT "produce a mid-range cloud that is too complex for
small deployments and tops out before you get to big ones"
I don't think "handle massive clouds" has ever NOT been on the list of
stated goals. (that mission statement has not changed since we started
the project - although I agree with Joe, it's in need of an update-
there is no mention of users)
BTW - Infra runs against currently runs against clouds rate-limited at
roughly 10 api calls / second. That's just one tenant - but it's a
perfectly managable rate. Now, if the cloud could continue to add nodes
and users without that rate degrading I think we'd be in really good shape.
>> Of course we can still do POC of anything with any DLM without knowing
>> the kind of target numbers we need to meet ;-)
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> __________________________________________________________________________
>>
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe:
>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
More information about the OpenStack-dev
mailing list