<font size=2 face="sans-serif">Background: Health maintenance is very
important to users, and I have users who want to do it now and into the
future. Today a Heat user can write a template that maintains the
health of a resource R. The detection of a health problem can be
done by anything that hits a webhook. That generality is important;
it is not sufficient to determine health by looking at what physical and/or
virtual resources exist, it is also highly desirable to test whether these
things are functioning well (e.g., the URL based health checking possible
through an OS::Neutron::Pool; e.g., the user has his own external system
that detects health problems). The webhook is provided by an OS::Heat::HARestarter
(note the name bug: such a thing does not restart anything, rather it deletes
and re-creates a given resource and all its dependents) that deletes and
re-creates R and its health detection/recovery wiring. For a more
specific example, consider the case of detection using the services of
an OS::Neutron::Pool. Note that it is not even necessary for there
to be workload traffic through the associated OS::Neutron::LoadBalancer;
all we are using here is the monitoring prescribed by the Pool's OS::Neutron::HealthMonitor.
The user's template has, in addition to R, three things: (1) an OS::Neutron::PoolMember
that puts R in the Pool, (2) an OS::Heat::HARestarter that deletes and
re-creates R and all its dependents, and (3) a Ceilometer alarm that detects
when Neutron is reporting that the PoolMember is unhealthy and responds
by hitting the HARestarter's webhook. Note that all three of those
are dependent on R, and thus are deleted and re-created when the HARestarter's
webhook is hit; this avoids most of the noted issues with HARestarter.
R can be a stack that includes both a Nova server and an OS::Neutron::Port,
to work around a Nova bug with implicit ports.</font>
<br>
<br><font size=2 face="sans-serif">There is a movement afoot to remove
HARestarter. My concern is what can users do, now and into the future.
The first and most basic issue is this: at every step in the roadmap,
it must be possible for users to accomplish health maintenance. The
second issue is easing the impact on what users write. It would be
pretty bad if the roadmap looks like this: before point X, users can only
accomplish health maintenance as I outlined above, and from point X onward
the user has to do something different. That is, there should be
a transition period during which users can do things either the old way
or the new way. It would be even better if we, or a cloud provider,
could provide an abstraction that will be usable throughout the roadmap
(once that abstraction becomes available). For example, if there
were a resource type OS::Heat::ReliableAutoScalingGroup that adds health
maintenance functionality (with detection by an OS::Neutron::Pool and exposure
of per-member webhooks usable by anything) to OS::Heat::AutoScalingGroup.
Once some other way to do that maintenance becomes available, the
implementation of OS::Heat::ReliableAutoScalingGroup could switch to that
without requiring any changes to users' templates. If at some point
in the future OS::Heat::ReliableAutoScalingGroup becomes exactly equivalent
to OS::Heat::AutoScalingGroup then we could deprecate OS::Heat::ReliableAutoScalingGroup
and, at a later time, remove it. Even better: since health maintenance
is not logically connected to scaling group membership, make the abstraction
be simply OS::Heat::HealthyResource (i.e., it is about a single resource
regardless of whether it is a member of a scaling group) rather than OS::Heat::ReliableAutoScalingGroup.
Question: would that abstraction (including the higher level detection
and exposure of re-creation webhook) be implementable (or a no-op) in the
planned future?</font>
<br>
<br><font size=2 face="sans-serif">To aid in understanding: while it may
be distasteful for a resource like HARestarter to tweak its containing
stack, the critical question is whether it will remain *possible* throughout
a transition period. Is there an issue with such hacks being *possible*
throughout a reasonable transition period?</font>
<br>
<br><font size=2 face="sans-serif">Thanks,<br>
Mike</font>