<tt><font size=2>Qiming Teng <tengqim@linux.vnet.ibm.com> wrote
on 07/02/2014 03:02:14 AM:<br>
<br>
> Just some random thoughts below ...<br>
> <br>
> On Tue, Jul 01, 2014 at 03:47:03PM -0400, Mike Spreitzer wrote:</font></tt>
<br><tt><font size=2>> > ...<br>
> > I have not found design discussion of this; have I missed something?<br>
> > <br>
> > I suppose the natural answer for OpenStack would be centered
around <br>
> > webhooks... <br>
> <br>
> Well, I would suggest we generalize this into a event messaging or<br>
> signaling solution, instead of just 'webhooks'. The reason is
that<br>
> webhooks as it is implemented today is not carrying a payload of useful<br>
> information -- I'm referring to the alarms in Ceilometer.</font></tt>
<br>
<br><tt><font size=2>OK, this is great (and Steve Hardy provided more details
in his reply), I did not know about the existing abilities to have a payload.
However Ceilometer alarms are still deficient in that way, right?
A Ceilometer alarm's action list is simply a list of URLs, right?
I would be happy to say let's generalize Ceilometer alarms to allow
a payload in an action.</font></tt>
<br><tt><font size=2><br>
> There are other cases as well. A member failure could be caused
by a <br>
> temporary communication problem, which means it may show up quickly
when<br>
> a replacement member is already being created. It may mean that
we have<br>
> to respond to an 'online' event in addition to an 'offline' event?</font></tt>
<br><tt><font size=2>> ...<br>
> The problem here today is about the recovery of SG member. If
it is a<br>
> compute instance, we can 'reboot', 'rebuild', 'evacuate', 'migrate'
it,<br>
> just to name a few options. The most brutal way to do this is
like what<br>
> HARestarter is doing today -- delete followed by a create.<br>
</font></tt>
<br><tt><font size=2>We could get into arbitrary subtlety, and maybe eventually
will do better, but I think we can start with a simple solution that is
widely applicable. The simple solution is that once the decision
has been made to do convergence on a member (note that this is distinct
from merely detecting and noting a divergence) then it will be done regardless
of whether the doomed member later appears to have recovered, and the convergence
action for a scaling group member is to delete the old member and create
a replacement (not in that order).<br>
<br>
> > When the member is a nested stack and Ceilometer exists, it could
be the <br>
> > member stack's responsibility to include a Ceilometer alarm that
detects <br>
> > the member stack's death and hit the member stack's deletion
webhook. <br>
> <br>
> This is difficult. A '(nested) stack' is a Heat specific abstraction
--<br>
> recall that we have to annotate a nova server resource in its metadata<br>
> to which stack this server belongs. Besides the 'visible' resources<br>
> specified in a template, Heat may create internal data structures
and/or<br>
> resources (e.g. users) for a stack. I am not quite sure a stack's
death<br>
> can be easily detected from outside Heat. It would be at least<br>
> cumbersome to have Heat notify Ceilometer that a stack is dead, and
then<br>
> have Ceilometer send back a signal.</font></tt>
<br>
<br><tt><font size=2>A (nested) stack is not only a heat-specific abstraction
but its semantics and failure modes are specific to the stack (at least,
its template). I think we have no practical choice but to let the
template author declare how failure is detected. It could be as simple
as creating a Ceilometer alarms that detect death one or more resources
in the nested stack; it could be more complicated Ceilometer stuff; it
could be based on something other than, or in addition to, Ceilometer.
If today there are not enough sensors to detect failures of all kinds
of resources, I consider that a gap in telemetry (and think it is small
enough that we can proceed usefully today, and should plan on filling that
gap over time).</font></tt>
<br><tt><font size=2><br>
> > There is a small matter of how the author of the template used
to create <br>
> > the member stack writes some template snippet that creates a
Ceilometer <br>
> > alarm that is specific to a member stack that does not exist
yet. <br>
> <br>
> How about just one signal responder per ScalingGroup? A SG is
supposed<br>
> to be in a better position to make the judgement: do I have to recreate<br>
> a failed member? am I recreating it right now or wait a few seconds?<br>
> maybe I should recreate the member on some specific AZs?</font></tt>
<br>
<br><tt><font size=2>That is confusing two issues. The thing that
is new here is making the scaling group recognize member failure; the primary
reaction is to update its accounting of members (which, in the current
code, must be done by making sure the failed member is deleted); recovery
of other scaling group aspects is fairly old-hat, it is analogous to the
problems that the scaling group already solves when asked to increase its
size.<br>
<br>
> ...<br>
> > I suppose we could stipulate that if the member template includes
a <br>
> > parameter with name "member_name" and type "string"
then the OS OG takes <br>
> > care of supplying the correct value of that parameter; as illustrated
in <br>
> > the asg_of_stacks.yaml of </font></tt><a href=https://review.openstack.org/#/c/97366/><tt><font size=2>https://review.openstack.org/#/c/97366/</font></tt></a><tt><font size=2>
, a <br>
> > member template can use a template parameter to tag Ceilometer
data for <br>
> > querying. The URL of the member stack's deletion webhook
could be passed <br>
> > to the member template via the same sort of convention. <br>
> <br>
> I am not in favor of the per-member webhook design. But I vote
for an<br>
> additional *implicit* parameter to a nested stack of any groups. It<br>
> could be an index or a name.</font></tt>
<br>
<br><tt><font size=2>Right, I was elaborating on a particular formulation
of "implicit parameter". In particular, I suggested an
"implicit parameter value" for an optional explicit parameter.
We could make the parameter declaration implicit, but that (1) is
a bit irregular (reminiscent of "modes") if we only do it for
stacks that are scaling group members and (2) is equivalent to the existing
concept of psuedo-parameters if we do it for all stacks. I would
be content with adding a pseudo-parameter for all stacks that is the UUID
of the stack. The index of the member in the group could be problematic,
as those are re-used; the UUID is not re-used. Names also have issues
with uniqueness.</font></tt>
<br><tt><font size=2><br>
> > When Ceilometer <br>
> > does not exist, it is less obvious to me what could usefully
be done. Are <br>
> > there any useful SG member types besides Compute instances and
nested <br>
> > stacks? Note that a nested stack could also pass its member
deletion <br>
> > webhook to a load balancer (that is willing to accept such a
thing, of <br>
> > course), so we get a lot of unity of mechanism between the case
of <br>
> > detection by infrastructure vs. application level detection.<br>
> > <br>
> <br>
> I'm a little bit concerned about passing the member deletion webhook
to<br>
> LB. Maybe we need to rethink about this: do we really want to
bring<br>
> application level design considerations down to the infrastructure
level?</font></tt>
<br>
<br><tt><font size=2>I look at it this way: do we want two completely independent
loops of detection and response, or shall we share a common response mechanism
with two different levels of detection? I think both want the same
response, and so recommend a shared response mechanism.</font></tt>
<br><tt><font size=2><br>
> Some of the detection work might be covered by the observer engine
specs<br>
> that is under review. My doubt about it is about how to make
it "listen<br>
> only to what need to know while ignore everything else".</font></tt>
<br>
<br><tt><font size=2>I am not sure what you mean by that. If this
is about the case of the group members being nested stacks, I go back to
the idea that it must be up to the nested template author to define failure
(via declaring how to detect it).<br>
<br>
> > I am not entirely happy with the idea of a webhook per member.
If I <br>
> > understand correctly, generating webhooks is a somewhat expensive
and <br>
> > problematic process. What would be the alternative?<br>
> <br>
> My understanding is that the webhooks' problem is not about cost,
it is<br>
> more about authentication and flexibility. Steve Hardy and Thomas
Herve<br>
> are already looking into the authentication problem.<br>
</font></tt>
<br><tt><font size=2>I was not disagreeing, I was including those in "problematic".</font></tt>
<br>
<br><tt><font size=2>Thanks,</font></tt>
<br><tt><font size=2>Mike</font></tt>
<br>
<br>