[openstack-dev] [Heat] How the autoscale API should control scaling in Heat

Steven Hardy shardy at redhat.com
Wed Sep 11 12:59:02 UTC 2013


On Wed, Sep 11, 2013 at 03:51:02AM +0000, Adrian Otto wrote:
> I have a different point of view. First I will offer some assertions:
> 
> A-1) We need to keep it simple.
> 	A-1.1) Systems that are hard to comprehend are hard to debug, and that's bad.
> 	A-1.2) Complex systems tend to be much more brittle than simple ones.

I don't think anyone will disagree with this, but the solutions we've been
discussing are not complex, or hard to comprehend.

The layered topology discussed is simply aimed at ensuring we don't have
significant duplicate functionality between services, and the best way to
do that is just to implement functionality in one service, ensuring the
scope of each service is sufficiently well defined and separated.

> A-2) Scale-up operations need to be as-fast-as-possible. 
> 	A-2.1) Auto-Scaling only works right if your new capacity is added quickly when your controller detects that you need more. If you spend a bunch of time goofing around before actually adding a new resource to a pool when its under staring. 
> 	A-2.2) The fewer network round trips between "add-more-resources-now" and "resources-added" the better. Fewer = less brittle.

Sure, latency in any control system is important, but in this case, the
additional delay caused by one additional service in the chain is very
likely to be insignificant compared to the time taken to build, launch, and
customize an instance.

> A-3) The control logic for scaling different applications vary. 
> 	A-3.1) What metrics are watched may differ between various use cases. 
> 	A-3.2) The data types that represent sensor data may vary.

So?  The metric source will be the same regardless of where AS is
implemented, ie ceilometer.

> 	A-3.3) The policy that's applied to the metrics (such as max, min, and cooldown period) vary between applications. Not only the values vary, but the logic itself.
> 	A-3.4) A scaling policy may not just be a handful of simple parameters. Ideally it allows configurable logic that the end-user can control to some extent.

Ok, so having some way to implement specialized scaling policies seems,
AFAICT, to be the main driver behind all this autoscaling-service
discussion?

I don't think anyone has ever said we shouldn't provide an interface which
allows end users to implement whatever scaling policy they want.  To some
extent Provider resources allready allow this.

Something we discussed at the Havana summit (but has not yet been
implemented) was the idea of a generic webhook based policy resource, which
took data associated with the scaling event (alarm) and simply made a
request to $special_scaling_service and then acted on the result.  This
would probably be very easy to implement as a Heat resource.

> A-4) Auto-scale operations are usually not orchestrations. They are usually simple linear workflows.
> 	A-4.1) The Taskflow project[1] offers a simple way to do workflows and stable state management that can be integrated directly into Autoscale.
> 	A-4.2) A task flow (workflow) can trigger a Heat orchestration if needed.

So, we should probably consider the nova group scheduling features here, it
seems like you basically just want a policy service between ceilometer and
the nova group-scheduling API?

This is fine, provided you never care about dependencies between resources
(instances).  As soon as you start thinking about stuff like clustering, or
notifying other dependent resources, it becomes an orchestration problem
IMO.

> Now a mental tool to think about control policies:
> 
> Auto-scaling is like steering a car. The control policy says that you want to drive equally between the two lane lines, and that if you drift off center, you gradually correct back toward center again. If the road bends, you try to remain in your lane as the lane lines curve. You try not to weave around in your lane, and you try not to drift out of the lane.
> 
> If your controller notices that you are about to drift out of your lane because the road is starting to bend, and you are distracted, or your hands slip off the wheel, you might drift out of your lane into nearby traffic. That's why you don't want a Rube Goldberg Machine[2] between you and the steering wheel. See assertions A-1 and A-2.
> 
> If you are driving an 18-wheel tractor/trailer truck, steering is different than if you are driving a Fiat. You need to wait longer and steer toward the outside of curves so your trailer does not lag behind on the inside of the curve behind you as you correct for a bend in the road. When you are driving the Fiat, you may want to aim for the middle of the lane at all times, possibly even apexing bends to reduce your driving distance, which is actually the opposite of what truck drivers need to do. Control policies apply to other parts of driving too. I want a different policy for braking than I use for steering. On some vehicles I go through a gear shifting workflow, and on others I don't. See assertion A-3.

Thanks for that amusingly verbose lecture in system dynamics ;D

As mentioned previously, I think the delay in the feedback loop caused by
how long it takes to spin up instances means you will need to damp the loop
so much (via cooldown periods) to avoid oscillation, that any delay
introduced by the controlling services is likely to be insignificant.

> So, I don't intend to argue the technical minutia of each design point, but I challenge you to make sure that we (1) arrive at a simple system that any OpenStack user can comprehend, (2) responds quickly to alarm stimulus, (3) is unlikely to fail, (4) can be easily customized with user-supplied logic that controls how the scaling happens, and under what conditions.

So (1) I think we all want something simple, but also flexible and without
undue duplication.

(2) I think this concern is overstated, as argued above

(3) Sure, there are a few ways to enable this, and a Heat scaling resource
is a valid and useful one IMO (not necessarily the only one)

> It would be better if we could explain Autoscale like this:
> 
> Heat -> Autoscale -> Nova, etc.
> -or-
> User -> Autoscale -> Nova, etc.
> 
> This approach allows use cases where (for whatever reason) the end user does not want to use Heat at all, but still wants something simple to be auto-scaled for them. Nobody would be scratching their heads wondering why things are going in circles.
> 
> From an implementation perspective, that means the auto-scale service needs at least a simple linear workflow capability in it that may trigger a Heat orchestration if there is a good reason for it. This way, the typical use cases don't have anything resembling circular dependencies. The source of truth for how many members are currently in an Autoscaling group should be the Autoscale service, not in the Heat database. If you want to expose that in list-stack-resources output, then cause Heat to call out to the Autoscale service to fetch that figure as needed. It is irrelevant to orchestration. Code does not need to be duplicated. Both Autoscale and Heat can use the same exact source code files for the code that launches/terminates instances of resources.

So I take issue with the "circular dependencies" statement, nothing
proposed so far has anything resembling a circular dependency.

I think it's better to consider traditional encapsulation, where two
projects may very well make use of the same class from a library.  Why is
it less valid to consider code reuse via another interface (ReST service)?

The point of the arguments to date, AIUI is to ensure orchestration actions
and management of dependencies don't get duplicated in any AS service which
is created.

It seems to me, as previously stated that what you really are describing
(and seem to want) is an Autoscaling *policy* service, which can act as a
decision point between alarms (from Ceilometer) and scaling actions (in
Heat, or potentially direct to Nova)

The recent rework to enable scaling actions to be triggered direct from
Ceilometer have actually made this much easier, and less tightly coupled to
Heat.  Heat AutoScaling actions can be triggered by a pre-signed web-hook
URL, which is passed to Ceilometer when we set up the alarm, and called
whenever an alarm happens.

We currently have:

Ceilometer --> Heat --> Nova

You seem to want (where orchestration is required):

Ceilometer --> AS Policy Service --> Heat --> Nova

You seem to want (where orchestration is *not* required):

Ceilometer --> AS Policy Service --> Nova

An optional alternate data-flow would be via a Heat resource representing
the policy service, where heat calls the policy service instead of using
the internal simple policy implementation.

The only issue I have with this, is I'm still not sure what value the
policy service actually adds, functionally, other than some percieved-to-be
simpler (or more AWSish?) AS API, and a way to plug in custom policies
without defining a Heat resource (examples of how you envisage them being
defined might help).

Steve



More information about the OpenStack-dev mailing list