<tt><font size=2>Qiming Teng <tengqim@linux.vnet.ibm.com> wrote

on 07/02/2014 03:02:14 AM:<br>

<br>

> Just some random thoughts below ...<br>

> <br>

> On Tue, Jul 01, 2014 at 03:47:03PM -0400, Mike Spreitzer wrote:</font></tt>

<br><tt><font size=2>> > ...<br>

> > I have not found design discussion of this; have I missed something?<br>

> > <br>

> > I suppose the natural answer for OpenStack would be centered

around <br>

> > webhooks...  <br>

> <br>

> Well, I would suggest we generalize this into a event messaging or<br>

> signaling solution, instead of just 'webhooks'.  The reason is

that<br>

> webhooks as it is implemented today is not carrying a payload of useful<br>

> information -- I'm referring to the alarms in Ceilometer.</font></tt>

<br>

<br><tt><font size=2>OK, this is great (and Steve Hardy provided more details

in his reply), I did not know about the existing abilities to have a payload.

 However Ceilometer alarms are still deficient in that way, right?

 A Ceilometer alarm's action list is simply a list of URLs, right?

 I would be happy to say let's generalize Ceilometer alarms to allow

a payload in an action.</font></tt>

<br><tt><font size=2><br>

> There are other cases as well.  A member failure could be caused

by a <br>

> temporary communication problem, which means it may show up quickly

when<br>

> a replacement member is already being created.  It may mean that

we have<br>

> to respond to an 'online' event in addition to an 'offline' event?</font></tt>

<br><tt><font size=2>> ...<br>

> The problem here today is about the recovery of SG member.  If

it is a<br>

> compute instance, we can 'reboot', 'rebuild', 'evacuate', 'migrate'

it,<br>

> just to name a few options.  The most brutal way to do this is

like what<br>

> HARestarter is doing today -- delete followed by a create.<br>

</font></tt>

<br><tt><font size=2>We could get into arbitrary subtlety, and maybe eventually

will do better, but I think we can start with a simple solution that is

widely applicable.  The simple solution is that once the decision

has been made to do convergence on a member (note that this is distinct

from merely detecting and noting a divergence) then it will be done regardless

of whether the doomed member later appears to have recovered, and the convergence

action for a scaling group member is to delete the old member and create

a replacement (not in that order).<br>

<br>

> > When the member is a nested stack and Ceilometer exists, it could

be the <br>

> > member stack's responsibility to include a Ceilometer alarm that

detects <br>

> > the member stack's death and hit the member stack's deletion

webhook. <br>

> <br>

> This is difficult.  A '(nested) stack' is a Heat specific abstraction

--<br>

> recall that we have to annotate a nova server resource in its metadata<br>

> to which stack this server belongs.  Besides the 'visible' resources<br>

> specified in a template, Heat may create internal data structures

and/or<br>

> resources (e.g. users) for a stack.  I am not quite sure a stack's

death<br>

> can be easily detected from outside Heat.  It would be at least<br>

> cumbersome to have Heat notify Ceilometer that a stack is dead, and

then<br>

> have Ceilometer send back a signal.</font></tt>

<br>

<br><tt><font size=2>A (nested) stack is not only a heat-specific abstraction

but its semantics and failure modes are specific to the stack (at least,

its template).  I think we have no practical choice but to let the

template author declare how failure is detected.  It could be as simple

as creating a Ceilometer alarms that detect death one or more resources

in the nested stack; it could be more complicated Ceilometer stuff; it

could be based on something other than, or in addition to, Ceilometer.

 If today there are not enough sensors to detect failures of all kinds

of resources, I consider that a gap in telemetry (and think it is small

enough that we can proceed usefully today, and should plan on filling that

gap over time).</font></tt>

<br><tt><font size=2><br>

> > There is a small matter of how the author of the template used

to create <br>

> > the member stack writes some template snippet that creates a

Ceilometer <br>

> > alarm that is specific to a member stack that does not exist

yet.  <br>

> <br>

> How about just one signal responder per ScalingGroup?  A SG is

supposed<br>

> to be in a better position to make the judgement: do I have to recreate<br>

> a failed member? am I recreating it right now or wait a few seconds?<br>

> maybe I should recreate the member on some specific AZs?</font></tt>

<br>

<br><tt><font size=2>That is confusing two issues.  The thing that

is new here is making the scaling group recognize member failure; the primary

reaction is to update its accounting of members (which, in the current

code, must be done by making sure the failed member is deleted); recovery

of other scaling group aspects is fairly old-hat, it is analogous to the

problems that the scaling group already solves when asked to increase its

size.<br>

<br>

> ...<br>

> > I suppose we could stipulate that if the member template includes

a <br>

> > parameter with name "member_name" and type "string"

then the OS OG takes <br>

> > care of supplying the correct value of that parameter; as illustrated

in <br>

> > the asg_of_stacks.yaml of </font></tt><a href=https://review.openstack.org/#/c/97366/><tt><font size=2>https://review.openstack.org/#/c/97366/</font></tt></a><tt><font size=2>

, a <br>

> > member template can use a template parameter to tag Ceilometer

data for <br>

> > querying.  The URL of the member stack's deletion webhook

could be passed <br>

> > to the member template via the same sort of convention.  <br>

> <br>

> I am not in favor of the per-member webhook design.  But I vote

for an<br>

> additional *implicit* parameter to a nested stack of any groups.  It<br>

> could be an index or a name.</font></tt>

<br>

<br><tt><font size=2>Right, I was elaborating on a particular formulation

of "implicit parameter".  In particular, I suggested an

"implicit parameter value" for an optional explicit parameter.

 We could make the parameter declaration implicit, but that (1) is

a bit irregular (reminiscent of "modes") if we only do it for

stacks that are scaling group members and (2) is equivalent to the existing

concept of psuedo-parameters if we do it for all stacks.  I would

be content with adding a pseudo-parameter for all stacks that is the UUID

of the stack.  The index of the member in the group could be problematic,

as those are re-used; the UUID is not re-used.  Names also have issues

with uniqueness.</font></tt>

<br><tt><font size=2><br>

> > When Ceilometer <br>

> > does not exist, it is less obvious to me what could usefully

be done.  Are <br>

> > there any useful SG member types besides Compute instances and

nested <br>

> > stacks?  Note that a nested stack could also pass its member

deletion <br>

> > webhook to a load balancer (that is willing to accept such a

thing, of <br>

> > course), so we get a lot of unity of mechanism between the case

of <br>

> > detection by infrastructure vs. application level detection.<br>

> > <br>

> <br>

> I'm a little bit concerned about passing the member deletion webhook

to<br>

> LB.  Maybe we need to rethink about this: do we really want to

bring<br>

> application level design considerations down to the infrastructure

level?</font></tt>

<br>

<br><tt><font size=2>I look at it this way: do we want two completely independent

loops of detection and response, or shall we share a common response mechanism

with two different levels of detection?  I think both want the same

response, and so recommend a shared response mechanism.</font></tt>

<br><tt><font size=2><br>

> Some of the detection work might be covered by the observer engine

specs<br>

> that is under review.  My doubt about it is about how to make

it "listen<br>

> only to what need to know while ignore everything else".</font></tt>

<br>

<br><tt><font size=2>I am not sure what you mean by that.  If this

is about the case of the group members being nested stacks, I go back to

the idea that it must be up to the nested template author to define failure

(via declaring how to detect it).<br>

<br>

> > I am not entirely happy with the idea of a webhook per member.

 If I <br>

> > understand correctly, generating webhooks is a somewhat expensive

and <br>

> > problematic process.  What would be the alternative?<br>

> <br>

> My understanding is that the webhooks' problem is not about cost,

it is<br>

> more about authentication and flexibility.  Steve Hardy and Thomas

Herve<br>

> are already looking into the authentication problem.<br>

</font></tt>

<br><tt><font size=2>I was not disagreeing, I was including those in "problematic".</font></tt>

<br>

<br><tt><font size=2>Thanks,</font></tt>

<br><tt><font size=2>Mike</font></tt>

<br>

<br>