[openstack-dev] [tripleo] Scaling of TripleO

Robert Collins robertc at robertcollins.net
Mon Sep 9 20:55:56 UTC 2013


On 10 September 2013 02:03, James Slagle <james.slagle at gmail.com> wrote:
>> working on the scale-out story in much detail
>> (I'm very interested in where you got the idea that
>> all-nodes-identical was the scaling plan for TripleO - it isn't :))
>
> It's just a misconception on my part.  I was trying to get an understanding of
> what a "2 machine/node undercloud in Full HA Mode" was.  I've seen that
> mentioned in some of the tripleo presentations I've watched on youtube and
> such.

Ok, so we'll need to be clearer in future discussions - thanks!

> What's the 2nd node in the undercloud?  Is it more similar to the Leaf Node
> proposal in Idea 1 I laid out...basically just enough services for Compute,
> Networking, etc?

The 2nd node would be identical, so that we get full HA.

> What do you mean by Full HA Mode?  The 2nd node serves as HA for the first, or
> 2 additional HA nodes, making 4 nodes total?  Or something else maybe :) ?

By full HA I mean 'all services in HA mode' vs 'most services in HA'.
We need HA for the bus, data store, APIs, endpoints - the works.
This requires a minimum of 2 nodes, probably recommended 3 so that we
don't have split brain situations; but beyond that we should be able
to add capacity to specific services rather than duplicating
everything: the big question for me is what rate we need to add
capacity, to what services.

>> The exact design of a scaled cluster isn't pinned down yet : I think
>> we need much more data before we can sensibly do it: both on
>> requirements- whats valuable for deployers - and on the scaling
>> characteristics of nova baremetal/Ironic/keystone etc.
>
> That maybe answers my previous question then.  The other node is not yet
> defined.  I think that makes sense given some of the higher level things you'd
> like to see discussed first, goals, requirements, etc.

Yup.

>> I don't really follow some of the discussion in Idea 1 : but scaling
>> out things that need scaling out seems pretty sensible. We have no
>> data suggesting how many thousands machines we'll get per nova
>> baremetal machine at the moment, so it's very hard to say what
>> services will need scaling at what points in time yet : but clearly we
>> need to support it at some scale. OTOH once we scale to 'an entire
>> datacentre' the undercloud doesn't need to scale further : I think
>> having each datacentre be a separate deployment cloud makes a lot of
>> sense.
>
> The point of Idea 1 was somewhat 2 fold:
>
> First, there is another image type, which we called the Leaf Node.  It's a
> smaller set of services, not the whole Undercloud.  Whatever is necessary to
> scale to larger workloads.  E.g., if the baremetal Compute driver does
> eventually prove to be a bottleneck, it would obviously incude that.
>
> Second, as hardware is grouped into Logical Racks (could be multiple physical
> racks or a subset of hardware across physical racks), you deploy a Leaf Node in
> the Logical Rack as well to act as the Undercloud's management interface (so to
> speak) to that logical rack.  This way, if you *wanted* to have some additional
> network isolation in the logical rack only the Leaf Nodes needs connectivity
> back to the main Undercloud node (with all services).
>
> Not saying that deploying a Leaf Node would be a hard requirement for each
> logical rack, but more of a best practice or reference implementation type
> approach.

I see. Ok - so the issue would be that if there is limited
connectivity for deployment orchestration, how does the first thing in
that rack get setup? I think we can simply define it as:
 * If you can PXE deploy into that rack from elsewhere, then that rack
is part of the $elsewhere undercloud.
 * If you cannot, then it is a new undercloud.

This will be a lot easier to reason about, for all that it may be
harder to deliver a single overcloud across both racks: it puts the
scheduling, configuration, how-to-preserve-ha-within-that-rack
concerns all clearly where they should be, and raises interesting
questions about heat for cross-cloud :).


>> This leads me to suggest a very simple design:
>>  - one undercloud per fully-reachable-fabric-of-IPMI control. Done :)
>>  - we gather data on performance scaling as node counts scales
>
> What type of hardware access does the team have to do any sort of performance
> scaling testing?

We've got 40ish production-cloud-scale machines w/10Gbps ethernet on a
gosh-I-don't-know-how-fast backplane today, though they are currently
running a long lived proof of concept. We'll have full access again at
the end of the month, and I'm going to see if I can reclaim some for
the sprint.

There are also some rather large testing labs w/in HP that we can use
with prior arrangement; once we've got disk-injection (our current
known scale-defeater) disableable in nova, I intend to arrange a scale
test.

> I can ask around and see what I can find.
>
> Alternatively, we could probably work on some sort of performance test suite
> that tested without a bunch of physical hardware.  E.g, you don't necessarily
> need a bunch of distinct nodes to test something like how many iscsi targets
> can Nova Compute reasonably populate at once, etc.

I think a perf test suite is a great idea. Probably wants to be part
of Tempest. That said, there are /significant/ performance differences
between virt and physical (and depending on exact config they can be
either overly-fast or overly-slow) so at most I'd want to use them as
a flag for actual physical testing.

For instance, deploying to baremetal from a kvm hosted seed node was
an OOM slower than deploying baremetal->baremetal. I didn't track down
the cause at the time, but it looked like bad jumbo frame support in
the virt network datapath causing low utilisation and overly high
physical packet counts.

-Rob

-- 
Robert Collins <rbtcollins at hp.com>
Distinguished Technologist
HP Converged Cloud



More information about the OpenStack-dev mailing list