Open Stack

Thu Apr 17 22:49:07 UTC 2014

Heyas, y'all!

So, given both the prioritization and usage info on HA functionality for
Neutron LBaaS here:
https://docs.google.com/spreadsheet/ccc?key=0Ar1FuMFYRhgadDVXZ25NM2NfbGtLTkR0TDFNUWJQUWc&usp=sharing

It's clear that:

A. HA seems to be a top priority for most operators
B. Almost all load balancer functionality deployed is done so in an
Active/Standby HA configuration

I know there's been some round-about discussion about this on the list in
the past (which usually got stymied in "implementation details"
disagreements), but it seems to me that with so many players putting a high
priority on HA functionality, this is something we need to discuss and
address.

This is also apropos, as we're talking about doing a major revision of the
API, and it probably makes sense to seriously consider if or how HA-related
stuff should make it into the API. I'm of the opinion that almost all the
HA stuff should be hidden from the user/tenant, but that the admin/operator
at the very least is going to need to have some visibility into HA-related
functionality. The hope here is to discover what things make sense to have
as a "least common denominator" and what will have to be hidden behind a
driver-specific implementation.

I certainly have a pretty good idea how HA stuff works at our organization,
but I have almost no visibility into how this is done elsewhere, leastwise
not enough detail to know what makes sense to write API controls for.

So! Since gathering data about actual usage seems to have worked pretty
well before, I'd like to try that again. Yes, I'm going to be asking about
implementation details, but this is with the hope of discovering any "least
common denominator" factors which make sense to build API around.

For the purposes of this document, when I say "load balancer devices" I
mean either physical or virtual appliances, or software executing on a host
somewhere that actually does the load balancing. It need not directly
correspond with anything physical... but probably does. :P

And... all of these questions are meant to be interpreted from the
perspective of the cloud operator.

Here's what I'm looking to learn from those of you who are allowed to share
this data:

1. Are your load balancer devices shared between customers / tenants, not
shared, or some of both?

1a. If shared, what is your strategy to avoid or deal with collisions of
customer rfc1918 address space on back-end networks? (For example, I know
of no load balancer device that can balance traffic for both customer A and
customer B if both are using the 10.0.0.0/24 subnet for their back-end
networks containing the nodes to be balanced, unless an extra layer of
NATing is happening somewhere.)

2. What kinds of metrics do you use in determining load balancing capacity?

3. Do you operate with a pool of unused load balancer device capacity
(which a cloud OS would need to keep track of), or do you spin up new
capacity (in the form of virtual servers, presumably) on the fly?

3a. If you're operating with a availability pool, can you describe how new
load balancer devices are added to your availability pool?  Specifically,
are there any steps in the process that must be manually performed (ie. so
no API could help with this)?

4. How are new devices 'registered' with the cloud OS? How are they removed
or replaced?

5. What kind of visibility do you (or would you) allow your user base to
see into the HA-related aspects of your load balancing services?

6. What kind of functionality and visibility do you need into the
operations of your load balancer devices in order to maintain your
services, troubleshoot, etc.? Specifically, are you managing the
infrastructure outside the purview of the cloud OS? Are there certain
aspects which would be easier to manage if done within the purview of the
cloud OS?

7. What kind of network topology is used when deploying load balancing
functionality? (ie. do your load balancer devices live inside or outside
customer firewalls, directly on tenant networks? Are you using layer-3
routing? etc.)

8. Is there any other data you can share which would be useful in
considering features of the API that only cloud operators would be able to
perform?

And since we're one of these operators, here are my responses:

1. We have both shared load balancer devices and private load balancer
devices.

1a. Our shared load balancers live outside customer firewalls, and we use
IPv6 to reach individual servers behind the firewalls "directly." We have
followed a careful deployment strategy across all our networks so that IPv6
addresses between tenants do not overlap.

2. The most useful ones for us are "number of appliances deployed" and
"number and type of load balancing services deployed" though we also pay
attention to:
* Load average per "active" appliance
* Per appliance number and type of load balancing services deployed
* Per appliance bandwidth consumption
* Per appliance connections / sec
* Per appliance SSL connections / sec

Since our devices are software appliances running on linux we also track
OS-level metrics as well, though these aren't used directly in the load
balancing features in our cloud OS.

3. We operate with an availability pool that our current cloud OS pays
attention to.

3a. Since the devices we use correspond to physical hardware this must of
course be rack-and-stacked by a datacenter technician, who also does
initial configuration of these devices.

4. All of our load balancers are deployed in an active / standby
configuration. Two machines which make up an active / standby pair are
registered with the cloud OS as a single unit that we call a "load balancer
cluster." Our availability pool consists of a whole bunch of these load
balancer clusters. (The devices themselves are registered individually at
the time the cluster object is created in our database.) There are a couple
manual steps in this process (currently handled by the datacenter techs who
do the racking and stacking), but these could be automated via API. In
fact, as we move to virtual appliances with these, we expect the entire
process to become automated via API (first cluster primitive is created,
and then "load balancer device objects" get attached to it, then the
cluster gets added to our availability pool.)

Removal of a "cluster" object is handled by first evacuating any customer
services off the cluster, then destroying the load balancer device objects,
then the cluster object. Replacement of a single load balancer device
entails removing the dead device, adding the new one, synchronizing
configuration data to it, and starting services.

5. At the present time, all our load balancing services are deployed in an
active / standby HA configuration, so the user has no choice or visibility
into any HA details. As we move to Neutron LBaaS, we would like to give
users the option of deploying non-HA load balancing capacity. Therefore,
the only visibility we want the user to get is:

* Choose whether a given load balancing service should be deployed in an HA
configuration ("flavor" functionality could handle this)
* See whether a running load balancing service is deployed in an HA
configuration (and see the "hint" for which physical or virtual device(s)
it's deployed on)
* Give a "hint" as to which device(s) a new load balancing service should
be deployed on (ie. for customers looking to deploy a bunch of test / QA /
etc. environments on the same device(s) to reduce costs).

Note that the "hint" above corresponds to the "load balancing cluster"
alluded to above, not necessarily any specific physical or virtual device.
This means we retain the ability to switch out the underlying hardware
powering a given service at any time.

Users may also see usage data, of course, but that's more of a generic
stats / billing function (which doesn't have to do with HA at all, really).

6. We need to see the status of all our load balancing devices, including
availability, current role (active or standby), and all the metrics listed
under 2 above. Some of this data is used for creating trend graphs and
business metrics, so being able to query the current metrics at any time
via API is important. It would also be very handy to query specific device
info (like revision of software on it, etc.) Our current cloud OS does all
this for us, and having Neutron LBaaS provide visibility into all of this
as well would be ideal. We do almost no management of our load balancing
services outside the purview of our current cloud OS.

7. Shared load balancers must live outside customer firewalls, private load
balancers typically live within customer firewalls (sometimes in a DMZ). In
any case, we use layer-3 routing (distributed using routing protocols on
our core networking gear and static routes on customer firewalls) to route
requests for "service IPs" to the "highly available routing IPs" which live
on the load balancers themselves. (When a fail-over happens, at a low
level, what's really going on is the "highly available routing IPs" shift
from the active to standby load balancer.)

We have contemplated using layer-2 topology (ie. directly connected on the
same vlan / broadcast domain) and are building a version of our appliance
which can operate in this way, potentially reducing the reliance on layer-3
routes (and making things more friendly for the OpenStack environment,
which we understand probably isn't ready for layer-3 routing just yet).

8. I wrote this survey, so none come to mind for me. :)

Stephen

-- 
Stephen Balukoff
Blue Box Group, LLC
(800)613-4305 x807
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140417/ff709b4c/attachment.html>

Open Stack

[openstack-dev] [Neutron][LBaaS] HA functionality discussion

OpenStack

Community

Documentation

Branding & Legal