[all][healthcheck]
Ben Nemec
openstack at nemebean.com
Wed Nov 25 17:51:24 UTC 2020
I finally found where I wrote down my notes from the discussion we had
about this with nova last cycle[0].
"""
Another longstanding topic that has recently come up is a standard
healthcheck endpoint for OpenStack services. In the process of enabling
the existing healthcheck middleware there was some question of how the
healthchecks should work. Currently it's a very simple check: if the api
process is running it returns success. There is also an option to
suppress the healthcheck based on the existence of a file. This allows a
deployer to signal a loadbalancer that the api will be going down for
maintenance.
However, there is obviously a lot more that goes into a given service's
health. We've been discussing how to make the healthcheck more
comprehensive since at least the Dublin PTG, but so far no one has been
able to commit the time to make any of these plans happen. At the Denver
PTG ~a year ago we agreed that the first step was to enable the
healthcheck middleware by default in all services. Some progress has
been made on that front, but when the change was proposed to Nova, they
asked a number of the questions related to the future improvements.
We revisited some of those questions at this PTG and came up with a plan
to move forward that everyone seemed happy with. One concern was that we
don't want to trigger resource-intensive healthchecks on unauthenticated
calls to an API. In the original discussions the plan was to have
healthchecks running in the background, and then the API call would just
return the latest results of the async checks. A small modification to
that was made in this discussion. Instead of having explicit async
processes to gather this data, it will be collected on regular
authenticated API calls. In this way, regularly used functionality will
be healthchecked more frequently, whereas less used areas of the service
will not. In addition, only authenticated users will be able to trigger
potentially resource intensive healthchecks.
Each project will be responsible for implementing these checks. Since
each project has a different architecture only they can say what
constitutes "healthy" for their service. It's possible we could provide
some common code for things like messaging and database that are used in
many services, but it's likely that many projects will also need some
custom checks.
I think that covers the major outcomes of this discussion, but we have
no notes from this session so if I forgot something let me know. ;-)
"""
0: http://blog.nemebean.com/content/oslo-virtual-ptg-victoria
On 11/23/20 3:28 AM, Lajos Katona wrote:
> Hi Erno,
> Thanks for the details, we will consider these.
>
> Regards
> Lajos
>
> Erno Kuvaja <ekuvaja at redhat.com <mailto:ekuvaja at redhat.com>> ezt írta
> (időpont: 2020. nov. 19., Cs, 18:14):
>
> On Mon, Nov 16, 2020 at 10:21 AM Lajos Katona <katonalala at gmail.com
> <mailto:katonalala at gmail.com>> wrote:
>
> Hi,
>
> I send this mail out to summarize the discussion around
> Healthcheck API on Neutron PTG, and start a discussion how we
> can make this most valuable to the operators.
>
> On the Neutron PTG etherpad this topic is from L114:
> https://etherpad.opendev.org/p/neutron-wallaby-ptg .
>
> Background: oslo_middleware provides /healthcheck API path(see
> [1]), which can be used to poll by services like haproxy, and
> gives a plugin mechanism to add some more complicated checks,
> which can be switched on/off from config.
>
> The main questions:
>
> * Some common guidance what to present to the operators (if
> you check [2] and [3] in the comments there are really good
> questions/concerns)
> o Perhaps the API SIG has something about healtcheck, just
> I can't find it.
> * What to present with and without authentication (after
> checking again, I am not sure that it is possible to use
> authentication for the healthcheck)
> o A way forward can be to make it configurable with
> default to authenticated, and give the decision to the
> admin.
> * During the discussion the agreement was to separate the
> frontend health from the backend health and use direct
> indicators (like working db connectivity, and mq
> connectivity) instead of indirect indicators (like agents'
> health).
>
> Thanks in advance for the feedback.
>
> [1]
> https://docs.openstack.org/oslo.middleware/latest/reference/healthcheck_plugins.html
> [2] https://review.opendev.org/731396
> [3] https://review.opendev.org/731554
>
> Regards
> Lajos Katona (lajoskatona)
>
>
> Hi Lajos,
>
> Bit of background in case you don't know. The oslo healthcheck
> middleware is basically a combination of healthcheck middlewares
> carried within the few projects ages ago bloated with the plugin
> framework I don't know of anyone ever adopted using. The main point
> for those middlewares carried by Swift(I think), Glance definitely
> and possibly some other projects before osloing it was to give a
> place for load balancers to ping that does not necessarily need to
> be logged every few seconds nor need to send the excessive amounts
> of auth calls to keystone. If I recall correctly you can already
> place it after keystone middleware if you prefer it being authed, I
> don't know of anyone who does.
>
> Main purpose was to provide a way to detect if the service is not
> responding or by using the disabled by file to bleed the inflight
> connections for maintenance and drop them off the pool for new
> requests. I think the original implementations were somewhere around
> 10-20 lines of code and did just that job pretty reliably.
>
> Based on the plugin model, it's indeed very easy to leak information
> out of that middleware and I think the plugins used need to take
> that into account by careful design. I'd very much prefer not
> breaking the current healthcheck and the very well stabilized API of
> it that has been in use for years just because someone feels like
> it's a good idea to make leaky plugins for it. Realizing that agent
> status might not be the right thing to check is a good start, what
> you really want to have is indication is the API service able to
> take in new requests or not, not if all permutations of those
> requests will succeed on the current system status. Now there are
> ways to chain multiples of these middlewares with different configs
> (with different endpoints) and it might be worth considering having
> your plugins with detailed failure conditions on the admin side that
> is not exposed to the public and just very simple yei/nei on your
> public endpoint. Good luck and I hope you find the correct balance
> of detail from the API and usability.
>
> Best,
> Erno "jokke" Kuvaja
>
More information about the openstack-discuss
mailing list