[openstack-dev] [tripleo] Scaling of TripleO

Robert Collins robertc at robertcollins.net
Fri Sep 6 21:31:14 UTC 2013


Hey James, thanks for starting this thread : it's clear we haven't
articulated what we've discussed well enough [it's been a slog
building up from the bottom...]

I think we need to set specific goals - latency, HA, diagnostics -
before designing scaling approaches makes any sense : we can't achieve
things we haven't set out to achieve.

For instance, if the entire set of goals was 'support 10K-node
overclouds', I think we can do that today with a 2-machine undercloud
control plane in full-HA mode.

So we need to be really clear : are we talking about scaling, or
latency @ scale, or ops @ scale - and identify failure modes we should
cater, vs ones we shouldn't cater to (or that are outside of our
domain e.g. 'you need a fully end to end multipath network if you want
network resiliency').

My vision for TripleO/undercloud and scale in the long term is:
- A fully redundant self-healing undercloud
  - (implies self hosting)
- And appropriate anti-affinity aggregates so that common failure
domains can be avoided
- With a scale-up Heat template that identifies the way to grow capacity
- Able to deploy to 1K overcloud in < an hour(*)
- And 10K [if we can get a suitable test environment] in < 2 hours

So thats sublinear performance degradation as scale increases.

For TripleO/overcloud and scale, thats something we need to synthesis
best practices from existing deployers - e.g. cells and so on - to
deliver K+ scale configurations, but it's fundamentally decoupled from
the undercloud: Heat is growing cross-cloud deployment facilities, so
if we need multiple undercloud's as a failure mitigation strategy, we
can deploy one overcloud across multiple underclouds that way. I'm not
convinced we need that complexity though: large network fabrics are
completely capable of shipping overcloud images to machines in a
couple of seconds per machine...

(*): Number pulled out of hat. We'll need to drive it lower over time,
but given we need time to check new builds are stable, and live
migrate thousands of VMs concurrently across hundreds of hypervisors,
I think 1 hour for a 1K node cloud deployment is sufficiently
aggressive for now.

Now, how to achieve this?

The current all-in-one control plane is like that for three key reasons:
 - small clouds need low-overhead control planes, running 12 or 15
machines to deploy a 3-node overcloud doesn't make sense.
 - bootstrapping an environment has to start on one machine by definition
 - we haven't finished enough of the overall plumbing story to be
working on the scale-out story in much detail
(I'm very interested in where you got the idea that
all-nodes-identical was the scaling plan for TripleO - it isn't :))

Our d-i-b elements are already suitable for scaling different
components independently - thats why nova and nova-kvm are separate:
nova installs the nova software, nova-kvm installs the additional bits
for a kvm hypervisor and configures the service to talk to the bus :
this is how the overcloud scales.

Now that we have reliable all-the-way-to overcloud deployments working
in devtest we've started working on the image-based updates
(https://etherpad.openstack.org/tripleo-image-updates) which is a
necessary precondition to scaling the undercloud control plane -
because if you can't update a machines role, it's really much harder
to evolve a cluster.

The exact design of a scaled cluster isn't pinned down yet : I think
we need much more data before we can sensibly do it: both on
requirements- whats valuable for deployers - and on the scaling
characteristics of nova baremetal/Ironic/keystone etc.

All that said, some specific thoughts on the broad approaches you sketched:
Running all services on all undercloud nodes would drive a lot of
complexity in scale-out : there's a lot of state to migrate to new
Galera nodes, for instance. I would hesitate to structure the
undercloud like that.

I don't really follow some of the discussion in Idea 1 : but scaling
out things that need scaling out seems pretty sensible. We have no
data suggesting how many thousands machines we'll get per nova
baremetal machine at the moment, so it's very hard to say what
services will need scaling at what points in time yet : but clearly we
need to support it at some scale. OTOH once we scale to 'an entire
datacentre' the undercloud doesn't need to scale further : I think
having each datacentre be a separate deployment cloud makes a lot of
sense.

Perhaps if we just turn the discussion around and ask - what do we get
if we add node type X to an undercloud; what do we get when we add a
new undercloud? and the implications thereof.

Firstly, lets talk big picture: N-datacentre clouds. I think the
'build a fabric that clearly exposes performance and failure domains'
has been very successful for containing complexity in the fabric and
enabling [app] deployers to reason about performance and failure, so
we shouldn't try to hide that. If you have two datacentres, that
should be two regions, with no shared infrastructure.

That immediately implies (at least) one undercloud per datacentre, and
separate overclouds too. Do we want IPMI running cross-datacentre? I
don't think so - bootstrap each datacentre independently, and once
it's running, it's running.

So within a datacentre - lets take HP's new Aurora facility
http://www.theregister.co.uk/2012/06/15/hp_aurora_data_centre/ - which
is perhaps best thought of as effectively subdivided into 5 cells,
each with about 9 tennis courts worth of servers :) There are
apparently 10kw rated racks, so if we filled it with moonshots we'd
get oh 250 servers per rack without running into trouble and what - 20
racks in a tennis court? So thats 20*9*250 or 45K servers per cell,
225K in the whole datacentre. Since each cell is self contained, it
would be counterproductive to extend a single overcloud across cells :
we need to work with the actual fabric of the DC; instead I think we'd
want to treat each cell as a separate DC. That then gives us a goal:
support 45K servers in a single 'DC'.

Now, IPMI security and so forth : I don't see any security
implications in shuttling IPMI cross-rack : IPMI is a secure protocol
and if it's not the issues we have are not sending it cross-rack, it's
machines in the same rack attacking each other. Additionally, to be
able to deploy undercloud machines themselves you need a full-HA
nova-baremetal with IPMI access, and you make that massively more
complex if you partition just some parts of the network but not all of
it : you'd need to model that in nova affinity to ensure you schedule
deployment nodes into the right area.

This leads me to suggest a very simple design:
 - one undercloud per fully-reachable-fabric-of-IPMI control. Done :)
 - we gather data on performance scaling as node counts scales
 - use that to parameterise how to grow the undercloud control plane for a cloud

HTH!

-Rob

-- 
Robert Collins <rbtcollins at hp.com>
Distinguished Technologist
HP Converged Cloud



More information about the OpenStack-dev mailing list