[openstack-dev] [nova] Proposal for an Experiment

Jesse Cook jesse.cook at RACKSPACE.COM
Mon Aug 3 19:24:40 UTC 2015



Jesse J. CookCompute Team Lead
jesse.cook at rackspace.com
irc: #compute-eng (gimchi)
mobile: 618-530-0659
 <https://rackspacemarketing.com/signatyourEmail/>
<https://www.linkedin.com/pub/jesse-cook/8/292/620>
<https://plus.google.com/u/0/+JesseCooks/posts/p/pub>




On 7/20/15, 12:40 PM, "Clint Byrum" <clint at fewbar.com> wrote:

>Excerpts from Jesse Cook's message of 2015-07-20 07:48:46 -0700:
>> 
>> On 7/15/15, 9:18 AM, "Ed Leafe" <ed at leafe.com> wrote:
>> 
>> >-----BEGIN PGP SIGNED MESSAGE-----
>> >Hash: SHA512
>> >
>> >Changing the architecture of a complex system such as Nova is never
>> >easy, even when we know that the design isn't working as well as we
>> >need it to. And it's even more frustrating because when the change is
>> >complete, it's hard to know if the improvement, if any, was worth it.
>> >
>> >So I had an idea: what if we ran a test of that architecture change
>> >out-of-tree? In other words, create a separate deployment, and rip out
>> >the parts that don't work well, replacing them with an alternative
>> >design. There would be no Gerrit reviews or anything that would slow
>> >down the work or add load to the already overloaded reviewers. Then we
>> >could see if this modified system is a significant-enough improvement
>> >to justify investing the time in implementing it in-tree. And, of
>> >course, if the test doesn't show what was hoped for, it is scrapped
>> >and we start thinking anew.
>> 
>> +1
>> >
>> >The important part in this process is defining up front what level of
>> >improvement would be needed to make considering actually making such a
>> >change worthwhile, and what sort of tests would demonstrate whether or
>> >not whether this level was met. I'd like to discuss such an experiment
>> >next week at the Nova mid-cycle.
>> >
>> >What I'd like to investigate is replacing the current design of having
>> >the compute nodes communicating with the scheduler via message queues.
>> >This design is overly complex and has several known scalability
>> >issues. My thought is to replace this with a Cassandra [1] backend.
>> >Compute nodes would update their state to Cassandra whenever they
>> >change, and that data would be read by the scheduler to make its host
>> >selection. When the scheduler chooses a host, it would post the claim
>> >to Cassandra wrapped in a lightweight transaction, which would ensure
>> >that no other scheduler has tried to claim those resources. When the
>> >host has built the requested VM, it will delete the claim and update
>> >Cassandra with its current state.
>> >
>> >One main motivation for using Cassandra over the current design is
>> >that it will enable us to run multiple schedulers without increasing
>> >the raciness of the system. Another is that it will greatly simplify a
>> >lot of the internal plumbing we've set up to implement in Nova what we
>> >would get out of the box with Cassandra. A third is that if this
>> >proves to be a success, it would also be able to be used further down
>> >the road to simplify inter-cell communication (but this is getting
>> >ahead of ourselves...). I've worked with Cassandra before and it has
>> >been rock-solid to run and simple to set up. I've also had preliminary
>> >technical reviews with the engineers at DataStax [2], the company
>> >behind Cassandra, and they agreed that this was a good fit.
>> >
>> >At this point I'm sure that most of you are filled with thoughts on
>> >how this won't work, or how much trouble it will be to switch, or how
>> >much more of a pain it will be, or how you hate non-relational DBs, or
>> >any of a zillion other negative thoughts. FWIW, I have them too. But
>> >instead of ranting, I would ask that we acknowledge for now that:
>> 
>> Call me an optimist, I think this can work :)
>> 
>> I would prefer a solution that avoids state management all together and
>> instead depends on each individual making rule-based decisions using
>>their
>> limited observations of their perceived environment. Of course, this has
>> certain emergent behaviors you have to learn from, but on the upside, no
>> more braiding state throughout the system. I don¹t like the assumption
>> that it has to be a global state management problem when it doesn¹t have
>> to be. That being said, I¹m not opposed to trying a solution like you
>> described using Cassandra or something similar. I generally support
>> improvements :)
>> 
>
>
>> >
>> >a) it will be disruptive and painful to switch something like this at
>> >this point in Nova's development
>> >b) it would have to provide *significant* improvement to make such a
>> >change worthwhile
>> >
>> >So what I'm asking from all of you is to help define the second part:
>> >what we would want improved, and how to measure those benefits. In
>> >other words, what results would you have to see in order to make you
>> >reconsider your initial "nah, this'll never work" reaction, and start
>> >to think that this is will be a worthwhile change to make to Nova.
>> 
>> I¹d like to see n build requests within 1 second each be successfully
>> scheduled to a host that has spare capacity with only say a total system
>> capacity of n * 1.10 where n >= 10000, each cell having ~100 hosts, the
>> number of hosts is >= n * 0.10 and <= n * 0.90, and the number of
>> schedulers is >= 2.
>> 
>> For example:
>> 
>> Build requests: 10000 in 1 second
>> Slots for flavor requested: 11000
>> Hosts that can build flavor: 7500
>> Number of schedulers: 3
>> Number of cells: 75 (each with 100 hosts)
>> 
>
>This is right on, though one thing missing is where the current code
>fails this test. It would be great to have the numbers above available
>as a baseline so we can denote progress in any experiment.

The cell level scheduling code over-schedules to cells and cannot retry.

>
>Also, I'm a little confused why you'd want cells still, but perhaps the
>idea is to get the scale of one cell so high, you don't actually ever
>want cells, since at that point you should really be building new regions?

Cells are just another horizontal scaling construct. There are good use
cases for them especially in a world of horizontal scaling. For example,
operators standing up more servers in a region to go online at some point
in the near future.

>
>To your earlier point about state being abused in the system, I
>totally 100% agree. In the past I've wondered a lot if there can be a
>worker model, where compute hosts all try to grab work off queues if
>they have available resources. So API requests for boot/delete don't
>change any state, they just enqueue a message. Queues would be matched
>up to resources and the more filter choices, the more queues. Each
>time a compute node completed a task (create vm, destroy vm) it would
>re-evaluate all of the queues and subscribe to the ones it could satisfy
>right now. Quotas would simply be the first stop for the enqueued create
>messages, and a final stop for the enqueued delete messages (once its
>done, release quota). If you haven't noticed, this would agree with Robert
>Collins's suggestion that something like Kafka is a technology more suited
>to this (or my favorite old-often-forgotten solution to this , Gearman. ;)
>
>This would have no global dynamic state, and very little local dynamic
>state. API, conductor, and compute nodes simply need to know all of the
>choices users are offered, and there is no scheduler at runtime, just
>a predictive queue-list-manager that only gets updated when choices are
>added or removed. This would relieve a ton of the burden currently put
>on the database by scheduling since the only accesses would be simple
>read/writes (that includes 'server-list' type operations since that
>would read a single index key).

I think we are much thinking in the same way here. I like the general
approach.
>
>Anyway, That's way off track, but I think this kind of thinking needs
>to happen and be taken seriously without turning into a bikeshed or fist
>fight. I don't think that will happen naturally until we start measuring
>where we are, and listening to operators as to where they'd like to be
>in relation to that.
>
>So in the interest of ending a long message with actions rather than
>words, lets get some measurement going. Rally? Something else? What can
>we do to measure this?

Performance tests against 1000 node clusters being setup by OSIC? Sounds
like you have a playground for your tests.
>
>__________________________________________________________________________
>OpenStack Development Mailing List (not for usage questions)
>Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



More information about the OpenStack-dev mailing list