[openstack-dev] [Neutron][LBaaS] Proposal for model change - Loadbalancer Instance feedback

Stephen Balukoff sbalukoff at bluebox.net
Wed Feb 19 01:59:40 UTC 2014


Hi Eugene,

First, thanks again for your speedy response and entertaining my ideas, eh!
And I apologize that it took me a few days to get back to you on this.

My thoughts are inline below:


On Thu, Feb 13, 2014 at 4:07 AM, Eugene Nikanorov
<enikanorov at mirantis.com>wrote:

>
>> I've been reading through the LoadBalancerInsance description as outlined
>> here and have some feedback:
>> https://wiki.openstack.org/wiki/Neutron/LBaaS/LoadbalancerInstance
>>
>> First off, I agree that we need a container object and that the pool
>> shouldn't be the object root. This container object is going to have some
>> attributes associated with it which then will apply to all related objects
>> further down on the chain.  (I'm thinking, for example, that it may make
>> sense for the loadbalancer to have 'network_id' as an attribute, and the
>> associated VIPs, pools, etc. will inherit this from the container object.)
>>
>
> Particularly, network_id could be different for vip and a pool in case the
> balancer works in routed mode (e.g. connects to different networks)
>

For VIP and pool subnets in routed mode, would these be associated with a
neutron network? (It occurs to me that layer-3 routing means subnets in
this category won't be associated with a "connected" layer-2 network, so
much as a layer-3 routing rule, which sends all traffic destined for said
subnets ultimately to some address on a layer-2 "connected" subnet.)  I'm
unfamiliar with how this is supposed to work with Neutron.

Also, do "pools" really need a network_id?  (Wouldn't it be valid for the
pool's members to come from several different networks, so long as the load
balancer has a way (route, NAT, etc.) to communicate with them?)
 (Practically speaking, we almost never see this--  but there's no
technical reason it couldn't be done.)


> One thing that was not clear to me just yet:  Is the 'loadbalancer' object
>> meant, at least eventually, to be associated with an actual load balancer
>> device of some kind (be that the neutron node with haproxy, a vendor
>> appliance or a software appliance)?
>>
>
> Yes, that is one of the proposed roles of 'loadbalancer' object. But not
> the only. Appliance is not the only representation of the balancer that we
> are working with, it could also be a process on the host that is controlled
> by the agent. So other types of associations also are necessary (like
> association between the agent and the 'loadbalancer')
>
>
Yes, of course the local haproxy instance would fall under the same model--
but what I mean by this is that a 'loadbalancer' is meant to ultimately be
associated with a "device" of some kind, even if that device is the special
case of an haproxy instance running on the neutron network node.


>
>> If not, then I think we should use a name other than 'Loadbalancer' so we
>> don't confuse people. I realize I might just be harping on one of the two
>> truly difficult problems in software engineering (which are: Naming things,
>> cache invalidation, and off-by-one errors). But if a 'loadbalancer' object
>> isn't meant to actually be synonymous with a load balancer appliance of
>> some kind, the object needs a new name.
>>
>
> I don't mind to have other name like 'instance', for example. But
> appliance (would it be a device or a process+agent) is really a synonym of
> what I am proposing.
>
>
>> If the object and the device are meant to essentially be synonymous, then
>> I think we're starting off too simplistic here, and the model proposed is
>> going to need another significant revision when we add additional features
>> later on.  I suspect we'll be painting ourselves into a corner with the
>> LoadBalancerInstance as proposed. Specifically, I'm thinking about:
>>
>>
>>    - Operational concerns around the life cycle of a physical piece of
>>    infrastructure. If we're going to replace a physical load balancer, it
>>    often makes sense to have both the old and new load balancer defined in the
>>    system at the same time during the transition. If you then swap all the
>>    VIPs from the old to the new, suddenly all the child objects have their
>>    loadbalancer_id changed, which will often wreak havoc on client application
>>    code (who really shouldn't be hard-coding things like loadbalancer_id, but
>>    will do so anyway. :P ) Such transitions are much easier accomplished if
>>    both load balancers can exist within an overarching container object (ie.
>>    "cluster" in my proposal) which will never need to be swapped out.
>>
>> I'd like to understand that better. I guess no model is complex enough to
> describe each end every use case. What I'm trying to address with the lb
> instance is both simplistic cases that are only supported by the current
> code plus some more complex configurations like multiple pools (L7) and
> multiple vips. And at the same time we need to consider backward
> compatibility and plus we need to make some progress. The bigger is the
> change, the harder it is to make a progress. So we need to find an
> iterative way of increasing API and model complexity.
>

No model is complex enough to show every use case, but the model should be
complex enough to show the most common (anticipated) topologies or use
cases. I believe these are: Single load balancer, HA (active-standby), and
HA (n-node active-active).  I understand the concern about not wanting to
change too many things at once in an effort to maintain backward
compatibility. I would add that it does nobody any good to over-engineer
the system without real-world data and use scenarios dictating need. Keep
it simple, stupid, right?

Having said this, I know we're already talking about breaking
backward-compatible workflows in the other thread. If we're going to have
to do this in order to support L7, and if we're too late to get these
changes into Icehouse anyway, then why not consider the next obvious
features to solve in Neutron load balancing once we have L7 and SSL: High
Availability and scalability? If it's such a pain to break
backward-compatible workflows, realize we're probably signing up for the
same pain again when we tackle the HA and scalability problems. :/

Besides, I don't think it's actually too much of a change to introduce
objects to the model which correspond to topology descriptions (ie.
"cluster" and "load balancer", separate from VIP, listener, etc.) These can
have very little effect on work-flow for the tenant as well, but can make
things much easier for the cloud administrator who has to worry about
operational considerations in delivering the load balancing service.

Anyway regarding the operational considerations you wanted to hear more
about:

One often hears the benefit of the cloud from the application developer's
perspective: You don't have to worry about "real" hardware anymore.
Software defines networking. There's a perception of infinite capacity. And
operational concerns are minimized as the system scales without having to
make (many) concessions in the code for the limitations of the underlying
hardware.

What one doesn't hear touted nearly as often is that the cloud also makes
things easier in many ways for those whose job it is to handle those
operational concerns. Suddenly maintenance schedules don't need to be
coordinated nearly as much with tenant schedules. Hardware can fail
willy-nilly and the operator just has to plug a new machine into the
network and the cloud "takes care of" assimilating it and making it
available for tenant use. Old hardware can be end-of-lifed gracefully and
new hardware installed without disrupting tenant application availability
or performance. Capacity planning can be done "in aggregate" across the
whole cloud without having to worry too much about a single tenant's quirky
cluster or needs. And especially: once the cloud os is advanced enough to
deliver out-of-the-box high-availability and scalability, those 3:00am
phone calls start to become less frequent.

At some level, *someone* needs to be aware of and concerned with physical
hardware and operational concerns. In the OpenStack paradigm, I think this
is the person or team acting as the "cloud administrator" (as opposed to
the "tenant" which is usually a team of application engineers). Tenants
generally don't want to be and shouldn't be exposed to anything that
approaches "real" hardware (unless there are business reasons for doing
so-- like the client having paid for premium load balancing services on
hardware purchased from Vendor X). But cloud administrators *must* be
exposed to actual hardware (or analogues thereto, like virtual appliances).

In any case, the cloud operating system needs to be aware of each component
(hardware, virtual appliance, process, whatever) that makes it work. It's
true that some complexity can be hidden behind a "driver" of some kind. But
I think this cripples us, when it comes to developing standard features
(let alone advanced features) that can be supported in the generic use
case. For example, I believe that high availability should be a standard
feature and not delegated to the driver level. Having a model which cannot
inherently work with high availability and scalability features is going to
be limiting, and is going to prevent the cloud from delivering its benefits
to cloud administrators.

There is precedent for this in Openstack as well:  Nova is aware of the
hardware nodes that make up compute. Neutron is aware of each of its
agents, as well as what physical nodes are delivering network services. It
seems strange to me that we wouldn't consider giving Neutron LBaaS
intelligence about how each of the hardware (or virtual hardware)
components fulfilling the load-balancing role are actually laid out and
configured. Further, if we deliver HA / scalability features only at the
driver level, then vendors are likely to have incompatible implementations,
which means that organizations who are vendor agnostic will nevertheless
experience vendor lock-in.

So why the need for the "cluster" object in the model I proposed? In my
experience, application developers tend not to care at all about how their
load balancing works... until they do. That is to say, throughout the
application life-cycle, the needs of what load balancing must do tend to
change toward the very specific as dictated by the application developers
and/or the business needs of the organization they're working for. (eg.
client doesn't care about the load balancer topology until a hardware
failure takes down the site for a half hour.) It's also quite common for
clients to be very opinionated about which Vendor's load balancing product
they use, and they'll sometimes have security concerns which might not make
much sense in a cloud (eg. "Staging and production environments *must* be
on different physical hardware. But to save costs, we'd like all these 15
development environments to be served from the same single load
balancer."), but which nevertheless are real concerns that operators must
conform to.

The "cluster" object therefore serves the role of being that container
object that tenants can be aware of, that will never need to be upgraded or
replaced in and of itself, and that application developers can write
automation around. At the same time, tenants do not need to be aware of the
"physical" (or virtual) "load balancer" components, that the cloud
administrator must be aware of, and out of which the "cluster" is actually
built. In other words, the "cluster" component serves as the conceptual
bridge and demarkation point between what the application developer /
tenant might need to worry about / write automation around / include in the
design for business reasons, and what the cloud administrator must worry
about by virtue of having to reliably deliver and maintain the "physical"
layer.

Does this make sense?


>>    - Having tenants know about loadbalancer_id (if it corresponds with
>>    physical hardware) feels inherently un-cloud-like to me. Better that said
>>    tenants know about the container object (which doesn't actually correspond
>>    with any single physical piece of infrastructure) and not concern
>>    themselves with physical hardware.
>>
>> Having loadbalancer_id has nothing to do with the appliance or particular
> backend, so that even might not give any clue to a tenant about the
>  backend type. However, tenant may want to know something about the backend
> and he/she may want to use the single appliance for their needs (due to
> quotas, billing or topology limitations), and that's where loadbalancer_id
> helps to envelop resources and group them to just one (some!) physical
> backend.
>

See discussion of "cluster" above, and why this should be different from
"load balancer instance."

>
>>    -
>>    - In an active-standby or active-active HA load balancer topology
>>    (ie. anything other than 'single device' topology), multiple load balancers
>>    will carry the same configuration, as far as VIPs, Pools, Members, etc. are
>>    concerned. Therefore, it doesn't make sense for the 'container' object to
>>    be synonymous with a single device. It might be possible to hide this
>>    complexity from the model by having HA features exist/exposed only within
>>    the driver, but this seems like really backward thinking to me: Why
>>    shouldn't we allow API-based configuration of load balancer cluster
>>    topology within our model, or force clients to talk to a driver directly
>>    for these features?  (This is one of the hack-ish work-arounds I alluded to
>>    in my e-mail from Monday which is both annoying and corrected with a model
>>    which can accurately reflect the topology we're working with.)
>>
>> HA is a valid question, and HA is definitely not represented by two
> different loadbalancer configurations. It is a property of one instance.
> Remember that loadbalancer and all its child objects are logical config, it
> could be deployed in HA or in single mode, depending on user's choice and
> driver capabilities.
> And regarding allowing users to talk directly to a driver - that's a big
> question. I think it can be desirable in some cases.
> You know, lb appliances nowadays can make cookies and fly to space and I'm
> not sure we want all that in the generic lb API, but that could be provided
> via specific extensions. Even in a public cloud case, where you want your
> tenants to be completely unaware of the backend, it's still possible that
> there are some tenants with specific demands (willing to pay) where more
> control could be useful.
>
> And going back to HA, this is a common feature that needs to be introduced
> into generic API.
>
>

So, I'm thinking that the "cluster" in my design fills that need.

For example, if a cluster has a "cluster_type" of "HA-active-standby" and
from that the cloud OS knows it needs to keep two load balancer nodes up
and running (and that these need to be configured to be in an
active-standby role, though such specific configuration will happen at the
driver level). Again, I think the cloud OS should be aware of how many
physical (or virtual) load balancer nodes make up a "cluster" and that this
shouldn't be visible only to the driver.

HA and horizontal scalability do not seem like those "fly into space"
vendor-specific features. They seem like core components we should be
shooting to achieve in the general case.


>
>>    - Side note: It's possible to still have drivers / load balancer
>>    appliances which do not support certain types of HA or auto-scaling
>>    topologies. In this case, it probably makes sense to add some kind of
>>    'capabilities' list that the driver publishes to the lbaas daemon when it's
>>    loaded.
>>
>>  Yes, that makes sense and was discussed previously, but not implemented
> so far.
>

Aah, cool!

>
>>    -
>>
>> So, I won't elaborate on the model I would propose we use instead of the
>> above, since I've already done so before. I'll just say that what we're
>> trying to solve with the LoadBalancerInstance resource in this proposal can
>> also be solved with the 'cluster' and 'loadbalancer' resources in the model
>> I've proposed, and my proposal is also capable of supporting HA and
>> auto-scaling topologies without further significant model changes.
>>
>> Beyond this, other feedback I have:
>>
>>
>>    - I don't recommend having loadbalancer_id added to the pool, member,
>>    and healthmonitor objects. I see no reason a pool (and its child objects)
>>    needs to be restricted to a single load balancer (and it may be very
>>    advantageous in large clusters for it not to be).
>>
>> That is something we need to evaluate. Currently we already have that
> with the difference that pool is our 'loadbalancer object'.
> Ability to share objects between appliances definitely makes sense, but
> I'm not sure we need to address it right now.
>

Fair enough-- but I worry that if people write automation around
determining loadbalancer_id ("cluster_id" in my model) based on the pool or
member, then we'll break that automation if we ever add the ability for a
single pool to be shared across load balancer (clusters).


>
>
>>    - I general, I prefer DRY code, so if we can avoid adding the
>>    loadbalancer_id attribute to existing resources except for where it's
>>    really needed, that's what I'd recommend. (Do we expect significant savings
>>    by repeating this attribute in various locations and avoiding one SQL
>>    query? It seems to me we're inviting annoying bugs we'll have to work out
>>    by having that field essentially act as a cache for the authoritative
>>    source of information-- ie. the load balancer (cluster) object itself.)
>>
>> Well, DRY principle is good, but save on sql queries is good also,
> because the more queries we do, the more are the chances of different
> races, not to say that it could just eat up a bit of performance if DB is
> large enough.
>

Fair enough, eh.

>
>>    -
>>    - Having loadbalancer_id (cluster_id in my model) an attribute of the
>>    VIP makes sense. I can't think of any reason a given VIP would be
>>    associated with multiple load balancer (clusters).
>>
>>
>> Thanks,
>> Stephen
>>
>>
> One of the important properties of proposed loadbalancer object is not
> just another entity of API, but also a helper object for various coding
> problems that mostly don't affect API consumers.
>
>
Yep--  I think I'm addressing this with my proposed "cluster" object, also
being separate from the "loadbalancer" object, eh.


> Thanks,
> Eugene.
>
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>


-- 
Stephen Balukoff
Blue Box Group, LLC
(800)613-4305 x807
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140218/01a38caf/attachment.html>


More information about the OpenStack-dev mailing list