[openstack-dev] [heat] Application level HA via Heat

Zane Bitter zbitter at redhat.com
Fri Jan 2 17:47:26 UTC 2015


On 24/12/14 05:17, Steven Hardy wrote:
> On Mon, Dec 22, 2014 at 03:42:37PM -0500, Zane Bitter wrote:
>> On 22/12/14 13:21, Steven Hardy wrote:
>>> Hi all,
>>>
>>> So, lately I've been having various discussions around $subject, and I know
>>> it's something several folks in our community are interested in, so I
>>> wanted to get some ideas I've been pondering out there for discussion.
>>>
>>> I'll start with a proposal of how we might replace HARestarter with
>>> AutoScaling group, then give some initial ideas of how we might evolve that
>>> into something capable of a sort-of active/active failover.
>>>
>>> 1. HARestarter replacement.
>>>
>>> My position on HARestarter has long been that equivalent functionality
>>> should be available via AutoScalingGroups of size 1.  Turns out that
>>> shouldn't be too hard to do:
>>>
>>>   resources:
>>>    server_group:
>>>      type: OS::Heat::AutoScalingGroup
>>>      properties:
>>>        min_size: 1
>>>        max_size: 1
>>>        resource:
>>>          type: ha_server.yaml
>>>
>>>    server_replacement_policy:
>>>      type: OS::Heat::ScalingPolicy
>>>      properties:
>>>        # FIXME: this adjustment_type doesn't exist yet
>>>        adjustment_type: replace_oldest
>>>        auto_scaling_group_id: {get_resource: server_group}
>>>        scaling_adjustment: 1
>>
>> One potential issue with this is that it is a little bit _too_ equivalent to
>> HARestarter - it will replace your whole scaled unit (ha_server.yaml in this
>> case) rather than just the failed resource inside.
>
> Personally I don't see that as a problem, because the interface makes that
> explicit - if you put a resource in an AutoScalingGroup, you expect it to
> get created/deleted on group adjustment, so anything you don't want
> replaced stays outside the group.

I guess I was thinking about having the same mechanism work when the 
size of the scaling group is not fixed at 1.

> Happy to consider other alternatives which do less destructive replacement,
> but to me this seems like the simplest possible way to replace HARestarter
> with something we can actually support long term.

Yeah, I just get uneasy about features that don't compose. Here you have 
to decide between the replacement policy feature and the feature of 
being able to scale out arbitrary stacks. The two uses are so different 
that they almost don't make sense as the same resource. The result will 
be a lot of people implementing scaling groups inside scaling groups in 
order to take advantage of both sets of behaviour.

> Even if "just replace failed resource" is somehow made available later,
> we'll still want to support AutoScalingGroup, and "replace_oldest" is
> likely to be useful in other situations, not just this use-case.
>
> Do you have specific ideas of how the just-replace-failed-resource feature
> might be implemented?  A way for a signal to declare a resource failed so
> convergence auto-healing does a less destructive replacement?
>
>>> So, currently our ScalingPolicy resource can only support three adjustment
>>> types, all of which change the group capacity.  AutoScalingGroup already
>>> supports batched replacements for rolling updates, so if we modify the
>>> interface to allow a signal to trigger replacement of a group member, then
>>> the snippet above should be logically equivalent to HARestarter AFAICT.
>>>
>>> The steps to do this should be:
>>>
>>>   - Standardize the ScalingPolicy-AutoScaling group interface, so
>>> aynchronous adjustments (e.g signals) between the two resources don't use
>>> the "adjust" method.
>>>
>>>   - Add an option to replace a member to the signal interface of
>>> AutoScalingGroup
>>>
>>>   - Add the new "replace adjustment type to ScalingPolicy
>>
>> I think I am broadly in favour of this.
>
> Ok, great - I think we'll probably want replace_oldest, replace_newest, and
> replace_specific, such that both alarm and operator driven replacement have
> flexibility over what member is replaced.

We probably want to allow users to specify the replacement policy (e.g. 
oldest first vs. newest first) for the scaling group itself to use when 
scaling down or during rolling updates. If we had that, we'd probably 
only need a single "replace" adjustment type - if a particular member is 
specified in the message then it would replace that specific one, 
otherwise the scaling group would choose which to replace based on the 
specified policy.

>>> I posted a patch which implements the first step, and the second will be
>>> required for TripleO, e.g we should be doing it soon.
>>>
>>> https://review.openstack.org/#/c/143496/
>>> https://review.openstack.org/#/c/140781/
>>>
>>> 2. A possible next step towards active/active HA failover
>>>
>>> The next part is the ability to notify before replacement that a scaling
>>> action is about to happen (just like we do for LoadBalancer resources
>>> already) and orchestrate some or all of the following:
>>>
>>> - Attempt to quiesce the currently active node (may be impossible if it's
>>>    in a bad state)
>>>
>>> - Detach resources (e.g volumes primarily?) from the current active node,
>>>    and attach them to the new active node
>>>
>>> - Run some config action to activate the new node (e.g run some config
>>>    script to fsck and mount a volume, then start some application).
>>>
>>> The first step is possible by putting a SofwareConfig/SoftwareDeployment
>>> resource inside ha_server.yaml (using NO_SIGNAL so we don't fail if the
>>> node is too bricked to respond and specifying DELETE action so it only runs
>>> when we replace the resource).
>>>
>>> The third step is possible either via a script inside the box which polls
>>> for the volume attachment, or possibly via an update-only software config.
>>>
>>> The second step is the missing piece AFAICS.
>>>
>>> I've been wondering if we can do something inside a new heat resource,
>>> which knows what the current "active" member of an ASG is, and gets
>>> triggered on a "replace" signal to orchestrate e.g deleting and creating a
>>> VolumeAttachment resource to move a volume between servers.
>>>
>>> Something like:
>>>
>>>   resources:
>>>    server_group:
>>>      type: OS::Heat::AutoScalingGroup
>>>      properties:
>>>        min_size: 2
>>>        max_size: 2
>>>        resource:
>>>          type: ha_server.yaml
>>>
>>>    server_failover_policy:
>>>      type: OS::Heat::FailoverPolicy
>>>      properties:
>>>        auto_scaling_group_id: {get_resource: server_group}
>>>        resource:
>>>          type: OS::Cinder::VolumeAttachment
>>>          properties:
>>>              # FIXME: "refs" is a ResourceGroup interface not currently
>>>              # available in AutoScalingGroup
>>>              instance_uuid: {get_attr: [server_group, refs, 1]}
>>>
>>>    server_replacement_policy:
>>>      type: OS::Heat::ScalingPolicy
>>>      properties:
>>>        # FIXME: this adjustment_type doesn't exist yet
>>>        adjustment_type: replace_oldest
>>>        auto_scaling_policy_id: {get_resource: server_failover_policy}
>>>        scaling_adjustment: 1
>>
>> This actually fails because a VolumeAttachment needs to be updated in place;
>> if you try to switch servers but keep the same Volume when replacing the
>> attachment you'll get an error.
>
> Doh, you're right, so FailoverPolicy would need to know how to delete then
> recreate the resource instead of doing an in-place update.

Other way around.

Well, actually there are two options, I guess. We could have a 
FailoverPolicy that deletes the old resource before creating the new one 
- this is the opposite of how existing stack updates work, so that 
implies that this would be new code that doesn't rely on the existing 
stack updates. The other option is to use the usual update mechanism to 
do an in-place update if possible - but in that case you don't require 
the FailoverPolicy resource, a regular update on the main template would 
have the same effect (as discussed below).

>> TBH {get_attr: [server_group, refs, 1]} is doing most of the heavy lifting
>> here, so in theory you could just have an OS::Cinder::VolumeAttachment
>> instead of the FailoverPolicy and then all you need is a way of triggering a
>> stack update with the same template & params. I know Ton added a PATCH
>> method to update in Juno so that you don't have to pass parameters any more,
>> and I believe it's planned to do the same with the template.
>
> Interesting, any thoughts on what the template-level interface to that
> PATCH update might look like?  (I'm guessing you'll probably say a mistral
> resource?)

Hmm, interesting question. It would be possible to pass a stack ID as a 
property to the scaling policy (in the given example you'd pass 
{get_param: OS::stack_id}) to have it trigger an update on some stack. 
(In fact, assuming that OS::Heat::FailoverPolicy is implemented as a 
nested stack, that's identical in implementation to what you proposed.) 
In a post-convergence world you can even imagine that it wouldn't need 
to be specified, and that an update to a child stack would always cause 
a re-evaluation of the parent.

Of course if you want a pluggable framework with potentially multiple 
sources of alarms and user-defined (rather than hard-coded) actions, 
then it's hard to go past Mistral.

>>> By chaining policies like this we could trigger an update on the attachment
>>> resource (or a nested template via a provider resource containing many
>>> attachments or other resources) every time the ScalingPolicy is triggered.
>>>
>>> For the sake of clarity, I've not included the existing stuff like
>>> ceilometer alarm resources etc above, but hopefully it gets the idea
>>> accross so we can discuss further, what are peoples thoughts?  I'm quite
>>> happy to iterate on the idea if folks have suggestions for a better
>>> interface etc :)
>>>
>>> One problem I see with the above approach is you'd have to trigger a
>>> failover after stack create to get the initial volume attached, still
>>> pondering ideas on how best to solve that..
>>
>> To me this is falling into the same old trap of "hey, we want to run this
>> custom workflow, all we need to do is add a new resource type to hang some
>> code on". That's pretty much how we got HARestarter.
>>
>> Also, like HARestarter, this cannot hope to cover the range of possible
>> actions that might be needed by various applications.
>>
>> IMHO the "right" way to implement this is that the Ceilometer alarm triggers
>> a workflow in Mistral that takes the appropriate action defined by the user,
>> which may (or may not) include updating the Heat stack to a new template
>> where the shared storage gets attached to a different server.
>
> Ok, I'm quite happy to accept this may be a better long-term solution, but
> can anyone comment on the current maturity level of Mistral?  Questions
> which spring to mind are:
>
> - Is the DSL stable now?
> - What's the roadmap re incubation (there are a lot of TBD's here:
>      https://wiki.openstack.org/wiki/Mistral/Incubation)
> - How does deferred authentication work for alarm triggered workflows, e.g
>    if a ceilometer alarm (which authenticates as a stack domain user) needs
>    to signal Mistral to start a workflow?
>
> I guess a first step is creating a contrib Mistral resource and
> investigating it, but it would be great if anyone has first-hand
> experiences they can share before we burn too much time digging into it.
>
> Cheers,
>
> Steve
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>




More information about the OpenStack-dev mailing list