[openstack-dev] [heat] Confused about the future of health maintenance and OS::Heat::HARestarter
Mike Spreitzer
mspreitz at us.ibm.com
Wed Sep 17 13:57:24 UTC 2014
Background: Health maintenance is very important to users, and I have
users who want to do it now and into the future. Today a Heat user can
write a template that maintains the health of a resource R. The detection
of a health problem can be done by anything that hits a webhook. That
generality is important; it is not sufficient to determine health by
looking at what physical and/or virtual resources exist, it is also highly
desirable to test whether these things are functioning well (e.g., the URL
based health checking possible through an OS::Neutron::Pool; e.g., the
user has his own external system that detects health problems). The
webhook is provided by an OS::Heat::HARestarter (note the name bug: such a
thing does not restart anything, rather it deletes and re-creates a given
resource and all its dependents) that deletes and re-creates R and its
health detection/recovery wiring. For a more specific example, consider
the case of detection using the services of an OS::Neutron::Pool. Note
that it is not even necessary for there to be workload traffic through the
associated OS::Neutron::LoadBalancer; all we are using here is the
monitoring prescribed by the Pool's OS::Neutron::HealthMonitor. The
user's template has, in addition to R, three things: (1) an
OS::Neutron::PoolMember that puts R in the Pool, (2) an
OS::Heat::HARestarter that deletes and re-creates R and all its
dependents, and (3) a Ceilometer alarm that detects when Neutron is
reporting that the PoolMember is unhealthy and responds by hitting the
HARestarter's webhook. Note that all three of those are dependent on R,
and thus are deleted and re-created when the HARestarter's webhook is hit;
this avoids most of the noted issues with HARestarter. R can be a stack
that includes both a Nova server and an OS::Neutron::Port, to work around
a Nova bug with implicit ports.
There is a movement afoot to remove HARestarter. My concern is what can
users do, now and into the future. The first and most basic issue is
this: at every step in the roadmap, it must be possible for users to
accomplish health maintenance. The second issue is easing the impact on
what users write. It would be pretty bad if the roadmap looks like this:
before point X, users can only accomplish health maintenance as I outlined
above, and from point X onward the user has to do something different.
That is, there should be a transition period during which users can do
things either the old way or the new way. It would be even better if we,
or a cloud provider, could provide an abstraction that will be usable
throughout the roadmap (once that abstraction becomes available). For
example, if there were a resource type OS::Heat::ReliableAutoScalingGroup
that adds health maintenance functionality (with detection by an
OS::Neutron::Pool and exposure of per-member webhooks usable by anything)
to OS::Heat::AutoScalingGroup. Once some other way to do that maintenance
becomes available, the implementation of
OS::Heat::ReliableAutoScalingGroup could switch to that without requiring
any changes to users' templates. If at some point in the future
OS::Heat::ReliableAutoScalingGroup becomes exactly equivalent to
OS::Heat::AutoScalingGroup then we could deprecate
OS::Heat::ReliableAutoScalingGroup and, at a later time, remove it. Even
better: since health maintenance is not logically connected to scaling
group membership, make the abstraction be simply OS::Heat::HealthyResource
(i.e., it is about a single resource regardless of whether it is a member
of a scaling group) rather than OS::Heat::ReliableAutoScalingGroup.
Question: would that abstraction (including the higher level detection and
exposure of re-creation webhook) be implementable (or a no-op) in the
planned future?
To aid in understanding: while it may be distasteful for a resource like
HARestarter to tweak its containing stack, the critical question is
whether it will remain *possible* throughout a transition period. Is
there an issue with such hacks being *possible* throughout a reasonable
transition period?
Thanks,
Mike
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140917/bf43b60c/attachment.html>
More information about the OpenStack-dev
mailing list