[all][healthcheck]

Lajos Katona katonalala at gmail.com
Thu Nov 26 12:35:42 UTC 2020


Thanks Ben,

This was again really helpful to understand the history of this feature,
and have a glimpse of the future direction.

I think with this Neutron community (and others of course) can plan now.

Regards
Lajos Katona
(lajoskatona)

Ben Nemec <openstack at nemebean.com> ezt írta (időpont: 2020. nov. 25., Sze,
18:51):

> I finally found where I wrote down my notes from the discussion we had
> about this with nova last cycle[0].
>
> """
> Another longstanding topic that has recently come up is a standard
> healthcheck endpoint for OpenStack services. In the process of enabling
> the existing healthcheck middleware there was some question of how the
> healthchecks should work. Currently it's a very simple check: if the api
> process is running it returns success. There is also an option to
> suppress the healthcheck based on the existence of a file. This allows a
> deployer to signal a loadbalancer that the api will be going down for
> maintenance.
>
> However, there is obviously a lot more that goes into a given service's
> health. We've been discussing how to make the healthcheck more
> comprehensive since at least the Dublin PTG, but so far no one has been
> able to commit the time to make any of these plans happen. At the Denver
> PTG ~a year ago we agreed that the first step was to enable the
> healthcheck middleware by default in all services. Some progress has
> been made on that front, but when the change was proposed to Nova, they
> asked a number of the questions related to the future improvements.
>
> We revisited some of those questions at this PTG and came up with a plan
> to move forward that everyone seemed happy with. One concern was that we
> don't want to trigger resource-intensive healthchecks on unauthenticated
> calls to an API. In the original discussions the plan was to have
> healthchecks running in the background, and then the API call would just
> return the latest results of the async checks. A small modification to
> that was made in this discussion. Instead of having explicit async
> processes to gather this data, it will be collected on regular
> authenticated API calls. In this way, regularly used functionality will
> be healthchecked more frequently, whereas less used areas of the service
> will not. In addition, only authenticated users will be able to trigger
> potentially resource intensive healthchecks.
>
> Each project will be responsible for implementing these checks. Since
> each project has a different architecture only they can say what
> constitutes "healthy" for their service. It's possible we could provide
> some common code for things like messaging and database that are used in
> many services, but it's likely that many projects will also need some
> custom checks.
>
> I think that covers the major outcomes of this discussion, but we have
> no notes from this session so if I forgot something let me know. ;-)
> """
>
> 0: http://blog.nemebean.com/content/oslo-virtual-ptg-victoria
>
> On 11/23/20 3:28 AM, Lajos Katona wrote:
> > Hi Erno,
> > Thanks for the details, we will consider these.
> >
> > Regards
> > Lajos
> >
> > Erno Kuvaja <ekuvaja at redhat.com <mailto:ekuvaja at redhat.com>> ezt írta
> > (időpont: 2020. nov. 19., Cs, 18:14):
> >
> >     On Mon, Nov 16, 2020 at 10:21 AM Lajos Katona <katonalala at gmail.com
> >     <mailto:katonalala at gmail.com>> wrote:
> >
> >         Hi,
> >
> >         I send this mail out to summarize the discussion around
> >         Healthcheck API on Neutron PTG, and start a discussion how we
> >         can make this most valuable to the operators.
> >
> >         On the Neutron PTG etherpad this topic is from L114:
> >         https://etherpad.opendev.org/p/neutron-wallaby-ptg .
> >
> >         Background: oslo_middleware provides /healthcheck API path(see
> >         [1]), which can be used to poll by services like haproxy, and
> >         gives a plugin mechanism to add some more complicated checks,
> >         which can be switched on/off from config.
> >
> >         The main questions:
> >
> >           * Some common guidance what to present to the operators (if
> >             you check [2] and [3] in the comments there are really good
> >             questions/concerns)
> >               o Perhaps the API SIG has something about healtcheck, just
> >                 I can't find it.
> >           * What to present with and without authentication (after
> >             checking again, I am not sure that it is possible to use
> >             authentication for the healthcheck)
> >               o A way forward can be to make it configurable with
> >                 default to authenticated, and give the decision to the
> >                 admin.
> >           * During the discussion the agreement was to separate the
> >             frontend health from the backend health and use direct
> >             indicators (like working db connectivity, and mq
> >             connectivity) instead of indirect indicators (like agents'
> >             health).
> >
> >         Thanks in advance for the feedback.
> >
> >         [1]
> >
> https://docs.openstack.org/oslo.middleware/latest/reference/healthcheck_plugins.html
> >         [2] https://review.opendev.org/731396
> >         [3] https://review.opendev.org/731554
> >
> >         Regards
> >         Lajos Katona (lajoskatona)
> >
> >
> >     Hi Lajos,
> >
> >     Bit of background in case you don't know. The oslo healthcheck
> >     middleware is basically a combination of healthcheck middlewares
> >     carried within the few projects ages ago bloated with the plugin
> >     framework I don't know of anyone ever adopted using. The main point
> >     for those middlewares carried by Swift(I think), Glance definitely
> >     and possibly some other projects before osloing it was to give a
> >     place for load balancers to ping that does not necessarily need to
> >     be logged every few seconds nor need to send the excessive amounts
> >     of auth calls to keystone. If I recall correctly you can already
> >     place it after keystone middleware if you prefer it being authed, I
> >     don't know of anyone who does.
> >
> >     Main purpose was to provide a way to detect if the service is not
> >     responding or by using the disabled by file to bleed the inflight
> >     connections for maintenance and drop them off the pool for new
> >     requests. I think the original implementations were somewhere around
> >     10-20 lines of code and did just that job pretty reliably.
> >
> >     Based on the plugin model, it's indeed very easy to leak information
> >     out of that middleware and I think the plugins used need to take
> >     that into account by careful design. I'd very much prefer not
> >     breaking the current healthcheck and the very well stabilized API of
> >     it that has been in use for years just because someone feels like
> >     it's a good idea to make leaky plugins for it. Realizing that agent
> >     status might not be the right thing to check is a good start, what
> >     you really want to have is indication is the API service able to
> >     take in new requests or not, not if all permutations of those
> >     requests will succeed on the current system status. Now there are
> >     ways to chain multiples of these middlewares with different configs
> >     (with different endpoints) and it might be worth considering having
> >     your plugins with detailed failure conditions on the admin side that
> >     is not exposed to the public and just very simple yei/nei on your
> >     public endpoint. Good luck and I hope you find the correct balance
> >     of detail from the API and usability.
> >
> >     Best,
> >     Erno "jokke" Kuvaja
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20201126/514e6244/attachment.html>


More information about the openstack-discuss mailing list