[openstack-dev] [tripleo] Scaling of TripleO

James Slagle james.slagle at gmail.com
Mon Sep 9 14:03:16 UTC 2013


On Sat, Sep 07, 2013 at 09:31:14AM +1200, Robert Collins wrote:
> Hey James, thanks for starting this thread : it's clear we haven't
> articulated what we've discussed well enough [it's been a slog
> building up from the bottom...]
> 
> I think we need to set specific goals - latency, HA, diagnostics -
> before designing scaling approaches makes any sense : we can't achieve
> things we haven't set out to achieve.
> 
> For instance, if the entire set of goals was 'support 10K-node
> overclouds', I think we can do that today with a 2-machine undercloud
> control plane in full-HA mode.
> 
> So we need to be really clear : are we talking about scaling, or
> latency @ scale, or ops @ scale - and identify failure modes we should
> cater, vs ones we shouldn't cater to (or that are outside of our
> domain e.g. 'you need a fully end to end multipath network if you want
> network resiliency').

That certainly sounds like a reasonable approach.

> 
> My vision for TripleO/undercloud and scale in the long term is:
> - A fully redundant self-healing undercloud
>   - (implies self hosting)
> - And appropriate anti-affinity aggregates so that common failure
> domains can be avoided
> - With a scale-up Heat template that identifies the way to grow capacity
> - Able to deploy to 1K overcloud in < an hour(*)
> - And 10K [if we can get a suitable test environment] in < 2 hours
> 
> So thats sublinear performance degradation as scale increases.
> 
> For TripleO/overcloud and scale, thats something we need to synthesis
> best practices from existing deployers - e.g. cells and so on - to
> deliver K+ scale configurations, but it's fundamentally decoupled from
> the undercloud: Heat is growing cross-cloud deployment facilities, so
> if we need multiple undercloud's as a failure mitigation strategy, we
> can deploy one overcloud across multiple underclouds that way. I'm not
> convinced we need that complexity though: large network fabrics are
> completely capable of shipping overcloud images to machines in a
> couple of seconds per machine...
> 
> (*): Number pulled out of hat. We'll need to drive it lower over time,
> but given we need time to check new builds are stable, and live
> migrate thousands of VMs concurrently across hundreds of hypervisors,
> I think 1 hour for a 1K node cloud deployment is sufficiently
> aggressive for now.
> 
> Now, how to achieve this?
> 
> The current all-in-one control plane is like that for three key reasons:
>  - small clouds need low-overhead control planes, running 12 or 15
> machines to deploy a 3-node overcloud doesn't make sense.
>  - bootstrapping an environment has to start on one machine by definition
>  - we haven't finished enough of the overall plumbing story to be
> working on the scale-out story in much detail
> (I'm very interested in where you got the idea that
> all-nodes-identical was the scaling plan for TripleO - it isn't :))

It's just a misconception on my part.  I was trying to get an understanding of
what a "2 machine/node undercloud in Full HA Mode" was.  I've seen that
mentioned in some of the tripleo presentations I've watched on youtube and
such.  

What's the 2nd node in the undercloud?  Is it more similar to the Leaf Node
proposal in Idea 1 I laid out...basically just enough services for Compute,
Networking, etc?

What do you mean by Full HA Mode?  The 2nd node serves as HA for the first, or
2 additional HA nodes, making 4 nodes total?  Or something else maybe :) ?

> Our d-i-b elements are already suitable for scaling different
> components independently - thats why nova and nova-kvm are separate:
> nova installs the nova software, nova-kvm installs the additional bits
> for a kvm hypervisor and configures the service to talk to the bus :
> this is how the overcloud scales.
> 
> Now that we have reliable all-the-way-to overcloud deployments working
> in devtest we've started working on the image-based updates
> (https://etherpad.openstack.org/tripleo-image-updates) which is a
> necessary precondition to scaling the undercloud control plane -
> because if you can't update a machines role, it's really much harder
> to evolve a cluster.
> 
> The exact design of a scaled cluster isn't pinned down yet : I think
> we need much more data before we can sensibly do it: both on
> requirements- whats valuable for deployers - and on the scaling
> characteristics of nova baremetal/Ironic/keystone etc.

That maybe answers my previous question then.  The other node is not yet
defined.  I think that makes sense given some of the higher level things you'd
like to see discussed first, goals, requirements, etc.

> 
> All that said, some specific thoughts on the broad approaches you sketched:
> Running all services on all undercloud nodes would drive a lot of
> complexity in scale-out : there's a lot of state to migrate to new
> Galera nodes, for instance. I would hesitate to structure the
> undercloud like that.

There is definitely added complexity in that model.  I just wanted to make sure
it was brought up.  It sounds like, we don't want that approach :).

> I don't really follow some of the discussion in Idea 1 : but scaling
> out things that need scaling out seems pretty sensible. We have no
> data suggesting how many thousands machines we'll get per nova
> baremetal machine at the moment, so it's very hard to say what
> services will need scaling at what points in time yet : but clearly we
> need to support it at some scale. OTOH once we scale to 'an entire
> datacentre' the undercloud doesn't need to scale further : I think
> having each datacentre be a separate deployment cloud makes a lot of
> sense.

The point of Idea 1 was somewhat 2 fold:

First, there is another image type, which we called the Leaf Node.  It's a
smaller set of services, not the whole Undercloud.  Whatever is necessary to
scale to larger workloads.  E.g., if the baremetal Compute driver does
eventually prove to be a bottleneck, it would obviously incude that.

Second, as hardware is grouped into Logical Racks (could be multiple physical
racks or a subset of hardware across physical racks), you deploy a Leaf Node in
the Logical Rack as well to act as the Undercloud's management interface (so to
speak) to that logical rack.  This way, if you *wanted* to have some additional
network isolation in the logical rack only the Leaf Nodes needs connectivity
back to the main Undercloud node (with all services).

Not saying that deploying a Leaf Node would be a hard requirement for each
logical rack, but more of a best practice or reference implementation type
approach.

> Perhaps if we just turn the discussion around and ask - what do we get
> if we add node type X to an undercloud; what do we get when we add a
> new undercloud? and the implications thereof.
> 
> Firstly, lets talk big picture: N-datacentre clouds. I think the
> 'build a fabric that clearly exposes performance and failure domains'
> has been very successful for containing complexity in the fabric and
> enabling [app] deployers to reason about performance and failure, so
> we shouldn't try to hide that. If you have two datacentres, that
> should be two regions, with no shared infrastructure.
> 
> That immediately implies (at least) one undercloud per datacentre, and
> separate overclouds too. Do we want IPMI running cross-datacentre? I
> don't think so - bootstrap each datacentre independently, and once
> it's running, it's running.
> 
> So within a datacentre - lets take HP's new Aurora facility
> http://www.theregister.co.uk/2012/06/15/hp_aurora_data_centre/ - which
> is perhaps best thought of as effectively subdivided into 5 cells,
> each with about 9 tennis courts worth of servers :) There are
> apparently 10kw rated racks, so if we filled it with moonshots we'd
> get oh 250 servers per rack without running into trouble and what - 20
> racks in a tennis court? So thats 20*9*250 or 45K servers per cell,
> 225K in the whole datacentre. Since each cell is self contained, it
> would be counterproductive to extend a single overcloud across cells :
> we need to work with the actual fabric of the DC; instead I think we'd
> want to treat each cell as a separate DC. That then gives us a goal:
> support 45K servers in a single 'DC'.
> 
> Now, IPMI security and so forth : I don't see any security
> implications in shuttling IPMI cross-rack : IPMI is a secure protocol
> and if it's not the issues we have are not sending it cross-rack, it's
> machines in the same rack attacking each other. Additionally, to be
> able to deploy undercloud machines themselves you need a full-HA
> nova-baremetal with IPMI access, and you make that massively more
> complex if you partition just some parts of the network but not all of
> it : you'd need to model that in nova affinity to ensure you schedule
> deployment nodes into the right area.

Indeed that is one of the questions about something like Idea 1.  How to make
sure that the correct Compute Node handles the deploy request for it's Logical
Rack.  It's something that I was starting to look at (although just
superficially really).  I'm not sure what all exists in nova scheduler to help
with this, or if the problem is already solved even.  I see various ways to
accomplish this in the Nova documentation.  Affinity, as you mentioned, looked
like one thing.  Host aggregates was another.

> 
> This leads me to suggest a very simple design:
>  - one undercloud per fully-reachable-fabric-of-IPMI control. Done :)
>  - we gather data on performance scaling as node counts scales

What type of hardware access does the team have to do any sort of performance
scaling testing?

I can ask around and see what I can find.

Alternatively, we could probably work on some sort of performance test suite
that tested without a bunch of physical hardware.  E.g, you don't necessarily
need a bunch of distinct nodes to test something like how many iscsi targets
can Nova Compute reasonably populate at once, etc.

>  - use that to parameterise how to grow the undercloud control plane for a cloud
> 
> HTH!

It does, excellent feedback!

--
-- James Slagle
--



More information about the OpenStack-dev mailing list