<font size=2 face="sans-serif">Background: Health maintenance is very

important to users, and I have users who want to do it now and into the

future.  Today a Heat user can write a template that maintains the

health of a resource R.  The detection of a health problem can be

done by anything that hits a webhook.  That generality is important;

it is not sufficient to determine health by looking at what physical and/or

virtual resources exist, it is also highly desirable to test whether these

things are functioning well (e.g., the URL based health checking possible

through an OS::Neutron::Pool; e.g., the user has his own external system

that detects health problems).  The webhook is provided by an OS::Heat::HARestarter

(note the name bug: such a thing does not restart anything, rather it deletes

and re-creates a given resource and all its dependents) that deletes and

re-creates R and its health detection/recovery wiring.  For a more

specific example, consider the case of detection using the services of

an OS::Neutron::Pool.  Note that it is not even necessary for there

to be workload traffic through the associated OS::Neutron::LoadBalancer;

all we are using here is the monitoring prescribed by the Pool's OS::Neutron::HealthMonitor.

 The user's template has, in addition to R, three things: (1) an OS::Neutron::PoolMember

that puts R in the Pool, (2) an OS::Heat::HARestarter that deletes and

re-creates R and all its dependents, and (3) a Ceilometer alarm that detects

when Neutron is reporting that the PoolMember is unhealthy and responds

by hitting the HARestarter's webhook.  Note that all three of those

are dependent on R, and thus are deleted and re-created when the HARestarter's

webhook is hit; this avoids most of the noted issues with HARestarter.

 R can be a stack that includes both a Nova server and an OS::Neutron::Port,

to work around a Nova bug with implicit ports.</font>

<br>

<br><font size=2 face="sans-serif">There is a movement afoot to remove

HARestarter.  My concern is what can users do, now and into the future.

 The first and most basic issue is this: at every step in the roadmap,

it must be possible for users to accomplish health maintenance.  The

second issue is easing the impact on what users write.  It would be

pretty bad if the roadmap looks like this: before point X, users can only

accomplish health maintenance as I outlined above, and from point X onward

the user has to do something different.  That is, there should be

a transition period during which users can do things either the old way

or the new way.  It would be even better if we, or a cloud provider,

could provide an abstraction that will be usable throughout the roadmap

(once that abstraction becomes available).  For example, if there

were a resource type OS::Heat::ReliableAutoScalingGroup that adds health

maintenance functionality (with detection by an OS::Neutron::Pool and exposure

of per-member webhooks usable by anything) to OS::Heat::AutoScalingGroup.

 Once some other way to do that maintenance becomes available, the

implementation of OS::Heat::ReliableAutoScalingGroup could switch to that

without requiring any changes to users' templates.  If at some point

in the future OS::Heat::ReliableAutoScalingGroup becomes exactly equivalent

to OS::Heat::AutoScalingGroup then we could deprecate OS::Heat::ReliableAutoScalingGroup

and, at a later time, remove it.  Even better: since health maintenance

is not logically connected to scaling group membership, make the abstraction

be simply OS::Heat::HealthyResource (i.e., it is about a single resource

regardless of whether it is a member of a scaling group) rather than OS::Heat::ReliableAutoScalingGroup.

 Question: would that abstraction (including the higher level detection

and exposure of re-creation webhook) be implementable (or a no-op) in the

planned future?</font>

<br>

<br><font size=2 face="sans-serif">To aid in understanding: while it may

be distasteful for a resource like HARestarter to tweak its containing

stack, the critical question is whether it will remain *possible* throughout

a transition period.  Is there an issue with such hacks being *possible*

throughout a reasonable transition period?</font>

<br>

<br><font size=2 face="sans-serif">Thanks,<br>

Mike</font>