[openstack-dev] [Neutron][LBaaS] HA functionality discussion
Susanne Balle
sleipnir012 at gmail.com
Fri Apr 18 00:49:47 UTC 2014
I agree that the HA should be hidden to the user/tenant. IMHO a tenant
should just use a load-balancer as a “managed” black box where the service
is resilient in itself.
Our current Libra/LBaaS implementation in the HP public cloud uses a pool
of standby LB to replace failing tenant’s LB. Our LBaaS service is
monitoring itself and replacing LB when they fail. This is via a set of
Admin API server.
http://libra.readthedocs.org/en/latest/admin_api/index.html
The Admin server spawns several scheduled threads to run tasks such as
building new devices for the pool, monitoring load balancer devices and
maintaining IP addresses.
http://libra.readthedocs.org/en/latest/pool_mgm/about.html
Susanne
On Thu, Apr 17, 2014 at 6:49 PM, Stephen Balukoff <sbalukoff at bluebox.net>wrote:
> Heyas, y'all!
>
> So, given both the prioritization and usage info on HA functionality for
> Neutron LBaaS here:
> https://docs.google.com/spreadsheet/ccc?key=0Ar1FuMFYRhgadDVXZ25NM2NfbGtLTkR0TDFNUWJQUWc&usp=sharing
>
> It's clear that:
>
> A. HA seems to be a top priority for most operators
> B. Almost all load balancer functionality deployed is done so in an
> Active/Standby HA configuration
>
> I know there's been some round-about discussion about this on the list in
> the past (which usually got stymied in "implementation details"
> disagreements), but it seems to me that with so many players putting a high
> priority on HA functionality, this is something we need to discuss and
> address.
>
> This is also apropos, as we're talking about doing a major revision of the
> API, and it probably makes sense to seriously consider if or how HA-related
> stuff should make it into the API. I'm of the opinion that almost all the
> HA stuff should be hidden from the user/tenant, but that the admin/operator
> at the very least is going to need to have some visibility into HA-related
> functionality. The hope here is to discover what things make sense to have
> as a "least common denominator" and what will have to be hidden behind a
> driver-specific implementation.
>
>
> I certainly have a pretty good idea how HA stuff works at our
> organization, but I have almost no visibility into how this is done
> elsewhere, leastwise not enough detail to know what makes sense to write
> API controls for.
>
> So! Since gathering data about actual usage seems to have worked pretty
> well before, I'd like to try that again. Yes, I'm going to be asking about
> implementation details, but this is with the hope of discovering any "least
> common denominator" factors which make sense to build API around.
>
> For the purposes of this document, when I say "load balancer devices" I
> mean either physical or virtual appliances, or software executing on a host
> somewhere that actually does the load balancing. It need not directly
> correspond with anything physical... but probably does. :P
>
> And... all of these questions are meant to be interpreted from the
> perspective of the cloud operator.
>
> Here's what I'm looking to learn from those of you who are allowed to
> share this data:
>
> 1. Are your load balancer devices shared between customers / tenants, not
> shared, or some of both?
>
> 1a. If shared, what is your strategy to avoid or deal with collisions of
> customer rfc1918 address space on back-end networks? (For example, I know
> of no load balancer device that can balance traffic for both customer A and
> customer B if both are using the 10.0.0.0/24 subnet for their back-end
> networks containing the nodes to be balanced, unless an extra layer of
> NATing is happening somewhere.)
>
> 2. What kinds of metrics do you use in determining load balancing capacity?
>
> 3. Do you operate with a pool of unused load balancer device capacity
> (which a cloud OS would need to keep track of), or do you spin up new
> capacity (in the form of virtual servers, presumably) on the fly?
>
> 3a. If you're operating with a availability pool, can you describe how new
> load balancer devices are added to your availability pool? Specifically,
> are there any steps in the process that must be manually performed (ie. so
> no API could help with this)?
>
> 4. How are new devices 'registered' with the cloud OS? How are they
> removed or replaced?
>
> 5. What kind of visibility do you (or would you) allow your user base to
> see into the HA-related aspects of your load balancing services?
>
> 6. What kind of functionality and visibility do you need into the
> operations of your load balancer devices in order to maintain your
> services, troubleshoot, etc.? Specifically, are you managing the
> infrastructure outside the purview of the cloud OS? Are there certain
> aspects which would be easier to manage if done within the purview of the
> cloud OS?
>
> 7. What kind of network topology is used when deploying load balancing
> functionality? (ie. do your load balancer devices live inside or outside
> customer firewalls, directly on tenant networks? Are you using layer-3
> routing? etc.)
>
> 8. Is there any other data you can share which would be useful in
> considering features of the API that only cloud operators would be able to
> perform?
>
>
> And since we're one of these operators, here are my responses:
>
> 1. We have both shared load balancer devices and private load balancer
> devices.
>
> 1a. Our shared load balancers live outside customer firewalls, and we use
> IPv6 to reach individual servers behind the firewalls "directly." We have
> followed a careful deployment strategy across all our networks so that IPv6
> addresses between tenants do not overlap.
>
> 2. The most useful ones for us are "number of appliances deployed" and
> "number and type of load balancing services deployed" though we also pay
> attention to:
> * Load average per "active" appliance
> * Per appliance number and type of load balancing services deployed
> * Per appliance bandwidth consumption
> * Per appliance connections / sec
> * Per appliance SSL connections / sec
>
> Since our devices are software appliances running on linux we also track
> OS-level metrics as well, though these aren't used directly in the load
> balancing features in our cloud OS.
>
> 3. We operate with an availability pool that our current cloud OS pays
> attention to.
>
> 3a. Since the devices we use correspond to physical hardware this must of
> course be rack-and-stacked by a datacenter technician, who also does
> initial configuration of these devices.
>
> 4. All of our load balancers are deployed in an active / standby
> configuration. Two machines which make up an active / standby pair are
> registered with the cloud OS as a single unit that we call a "load balancer
> cluster." Our availability pool consists of a whole bunch of these load
> balancer clusters. (The devices themselves are registered individually at
> the time the cluster object is created in our database.) There are a couple
> manual steps in this process (currently handled by the datacenter techs who
> do the racking and stacking), but these could be automated via API. In
> fact, as we move to virtual appliances with these, we expect the entire
> process to become automated via API (first cluster primitive is created,
> and then "load balancer device objects" get attached to it, then the
> cluster gets added to our availability pool.)
>
> Removal of a "cluster" object is handled by first evacuating any customer
> services off the cluster, then destroying the load balancer device objects,
> then the cluster object. Replacement of a single load balancer device
> entails removing the dead device, adding the new one, synchronizing
> configuration data to it, and starting services.
>
> 5. At the present time, all our load balancing services are deployed in an
> active / standby HA configuration, so the user has no choice or visibility
> into any HA details. As we move to Neutron LBaaS, we would like to give
> users the option of deploying non-HA load balancing capacity. Therefore,
> the only visibility we want the user to get is:
>
> * Choose whether a given load balancing service should be deployed in an
> HA configuration ("flavor" functionality could handle this)
> * See whether a running load balancing service is deployed in an HA
> configuration (and see the "hint" for which physical or virtual device(s)
> it's deployed on)
> * Give a "hint" as to which device(s) a new load balancing service should
> be deployed on (ie. for customers looking to deploy a bunch of test / QA /
> etc. environments on the same device(s) to reduce costs).
>
> Note that the "hint" above corresponds to the "load balancing cluster"
> alluded to above, not necessarily any specific physical or virtual device.
> This means we retain the ability to switch out the underlying hardware
> powering a given service at any time.
>
> Users may also see usage data, of course, but that's more of a generic
> stats / billing function (which doesn't have to do with HA at all, really).
>
> 6. We need to see the status of all our load balancing devices, including
> availability, current role (active or standby), and all the metrics listed
> under 2 above. Some of this data is used for creating trend graphs and
> business metrics, so being able to query the current metrics at any time
> via API is important. It would also be very handy to query specific device
> info (like revision of software on it, etc.) Our current cloud OS does all
> this for us, and having Neutron LBaaS provide visibility into all of this
> as well would be ideal. We do almost no management of our load balancing
> services outside the purview of our current cloud OS.
>
> 7. Shared load balancers must live outside customer firewalls, private
> load balancers typically live within customer firewalls (sometimes in a
> DMZ). In any case, we use layer-3 routing (distributed using routing
> protocols on our core networking gear and static routes on customer
> firewalls) to route requests for "service IPs" to the "highly available
> routing IPs" which live on the load balancers themselves. (When a fail-over
> happens, at a low level, what's really going on is the "highly available
> routing IPs" shift from the active to standby load balancer.)
>
> We have contemplated using layer-2 topology (ie. directly connected on the
> same vlan / broadcast domain) and are building a version of our appliance
> which can operate in this way, potentially reducing the reliance on layer-3
> routes (and making things more friendly for the OpenStack environment,
> which we understand probably isn't ready for layer-3 routing just yet).
>
> 8. I wrote this survey, so none come to mind for me. :)
>
> Stephen
>
> --
> Stephen Balukoff
> Blue Box Group, LLC
> (800)613-4305 x807
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140417/47b5e716/attachment.html>
More information about the OpenStack-dev
mailing list