[all][healthcheck]

Ben Nemec openstack at nemebean.com
Wed Nov 25 17:51:24 UTC 2020


I finally found where I wrote down my notes from the discussion we had 
about this with nova last cycle[0].

"""
Another longstanding topic that has recently come up is a standard 
healthcheck endpoint for OpenStack services. In the process of enabling 
the existing healthcheck middleware there was some question of how the 
healthchecks should work. Currently it's a very simple check: if the api 
process is running it returns success. There is also an option to 
suppress the healthcheck based on the existence of a file. This allows a 
deployer to signal a loadbalancer that the api will be going down for 
maintenance.

However, there is obviously a lot more that goes into a given service's 
health. We've been discussing how to make the healthcheck more 
comprehensive since at least the Dublin PTG, but so far no one has been 
able to commit the time to make any of these plans happen. At the Denver 
PTG ~a year ago we agreed that the first step was to enable the 
healthcheck middleware by default in all services. Some progress has 
been made on that front, but when the change was proposed to Nova, they 
asked a number of the questions related to the future improvements.

We revisited some of those questions at this PTG and came up with a plan 
to move forward that everyone seemed happy with. One concern was that we 
don't want to trigger resource-intensive healthchecks on unauthenticated 
calls to an API. In the original discussions the plan was to have 
healthchecks running in the background, and then the API call would just 
return the latest results of the async checks. A small modification to 
that was made in this discussion. Instead of having explicit async 
processes to gather this data, it will be collected on regular 
authenticated API calls. In this way, regularly used functionality will 
be healthchecked more frequently, whereas less used areas of the service 
will not. In addition, only authenticated users will be able to trigger 
potentially resource intensive healthchecks.

Each project will be responsible for implementing these checks. Since 
each project has a different architecture only they can say what 
constitutes "healthy" for their service. It's possible we could provide 
some common code for things like messaging and database that are used in 
many services, but it's likely that many projects will also need some 
custom checks.

I think that covers the major outcomes of this discussion, but we have 
no notes from this session so if I forgot something let me know. ;-)
"""

0: http://blog.nemebean.com/content/oslo-virtual-ptg-victoria

On 11/23/20 3:28 AM, Lajos Katona wrote:
> Hi Erno,
> Thanks for the details, we will consider these.
> 
> Regards
> Lajos
> 
> Erno Kuvaja <ekuvaja at redhat.com <mailto:ekuvaja at redhat.com>> ezt írta 
> (időpont: 2020. nov. 19., Cs, 18:14):
> 
>     On Mon, Nov 16, 2020 at 10:21 AM Lajos Katona <katonalala at gmail.com
>     <mailto:katonalala at gmail.com>> wrote:
> 
>         Hi,
> 
>         I send this mail out to summarize the discussion around
>         Healthcheck API on Neutron PTG, and start a discussion how we
>         can make this most valuable to the operators.
> 
>         On the Neutron PTG etherpad this topic is from L114:
>         https://etherpad.opendev.org/p/neutron-wallaby-ptg .
> 
>         Background: oslo_middleware provides /healthcheck API path(see
>         [1]), which can be used to poll by services like haproxy, and
>         gives a plugin mechanism to add some more complicated checks,
>         which can be switched on/off from config.
> 
>         The main questions:
> 
>           * Some common guidance what to present to the operators (if
>             you check [2] and [3] in the comments there are really good
>             questions/concerns)
>               o Perhaps the API SIG has something about healtcheck, just
>                 I can't find it.
>           * What to present with and without authentication (after
>             checking again, I am not sure that it is possible to use
>             authentication for the healthcheck)
>               o A way forward can be to make it configurable with
>                 default to authenticated, and give the decision to the
>                 admin.
>           * During the discussion the agreement was to separate the
>             frontend health from the backend health and use direct
>             indicators (like working db connectivity, and mq
>             connectivity) instead of indirect indicators (like agents'
>             health).
> 
>         Thanks in advance for the feedback.
> 
>         [1]
>         https://docs.openstack.org/oslo.middleware/latest/reference/healthcheck_plugins.html
>         [2] https://review.opendev.org/731396
>         [3] https://review.opendev.org/731554
> 
>         Regards
>         Lajos Katona (lajoskatona)
> 
> 
>     Hi Lajos,
> 
>     Bit of background in case you don't know. The oslo healthcheck
>     middleware is basically a combination of healthcheck middlewares
>     carried within the few projects ages ago bloated with the plugin
>     framework I don't know of anyone ever adopted using. The main point
>     for those middlewares carried by Swift(I think), Glance definitely
>     and possibly some other projects before osloing it was to give a
>     place for load balancers to ping that does not necessarily need to
>     be logged every few seconds nor need to send the excessive amounts
>     of auth calls to keystone. If I recall correctly you can already
>     place it after keystone middleware if you prefer it being authed, I
>     don't know of anyone who does.
> 
>     Main purpose was to provide a way to detect if the service is not
>     responding or by using the disabled by file to bleed the inflight
>     connections for maintenance and drop them off the pool for new
>     requests. I think the original implementations were somewhere around
>     10-20 lines of code and did just that job pretty reliably.
> 
>     Based on the plugin model, it's indeed very easy to leak information
>     out of that middleware and I think the plugins used need to take
>     that into account by careful design. I'd very much prefer not
>     breaking the current healthcheck and the very well stabilized API of
>     it that has been in use for years just because someone feels like
>     it's a good idea to make leaky plugins for it. Realizing that agent
>     status might not be the right thing to check is a good start, what
>     you really want to have is indication is the API service able to
>     take in new requests or not, not if all permutations of those
>     requests will succeed on the current system status. Now there are
>     ways to chain multiples of these middlewares with different configs
>     (with different endpoints) and it might be worth considering having
>     your plugins with detailed failure conditions on the admin side that
>     is not exposed to the public and just very simple yei/nei on your
>     public endpoint. Good luck and I hope you find the correct balance
>     of detail from the API and usability.
> 
>     Best,
>     Erno "jokke" Kuvaja
> 



More information about the openstack-discuss mailing list