[all][healthcheck]
Lajos Katona
katonalala at gmail.com
Thu Nov 26 12:35:42 UTC 2020
Thanks Ben,
This was again really helpful to understand the history of this feature,
and have a glimpse of the future direction.
I think with this Neutron community (and others of course) can plan now.
Regards
Lajos Katona
(lajoskatona)
Ben Nemec <openstack at nemebean.com> ezt írta (időpont: 2020. nov. 25., Sze,
18:51):
> I finally found where I wrote down my notes from the discussion we had
> about this with nova last cycle[0].
>
> """
> Another longstanding topic that has recently come up is a standard
> healthcheck endpoint for OpenStack services. In the process of enabling
> the existing healthcheck middleware there was some question of how the
> healthchecks should work. Currently it's a very simple check: if the api
> process is running it returns success. There is also an option to
> suppress the healthcheck based on the existence of a file. This allows a
> deployer to signal a loadbalancer that the api will be going down for
> maintenance.
>
> However, there is obviously a lot more that goes into a given service's
> health. We've been discussing how to make the healthcheck more
> comprehensive since at least the Dublin PTG, but so far no one has been
> able to commit the time to make any of these plans happen. At the Denver
> PTG ~a year ago we agreed that the first step was to enable the
> healthcheck middleware by default in all services. Some progress has
> been made on that front, but when the change was proposed to Nova, they
> asked a number of the questions related to the future improvements.
>
> We revisited some of those questions at this PTG and came up with a plan
> to move forward that everyone seemed happy with. One concern was that we
> don't want to trigger resource-intensive healthchecks on unauthenticated
> calls to an API. In the original discussions the plan was to have
> healthchecks running in the background, and then the API call would just
> return the latest results of the async checks. A small modification to
> that was made in this discussion. Instead of having explicit async
> processes to gather this data, it will be collected on regular
> authenticated API calls. In this way, regularly used functionality will
> be healthchecked more frequently, whereas less used areas of the service
> will not. In addition, only authenticated users will be able to trigger
> potentially resource intensive healthchecks.
>
> Each project will be responsible for implementing these checks. Since
> each project has a different architecture only they can say what
> constitutes "healthy" for their service. It's possible we could provide
> some common code for things like messaging and database that are used in
> many services, but it's likely that many projects will also need some
> custom checks.
>
> I think that covers the major outcomes of this discussion, but we have
> no notes from this session so if I forgot something let me know. ;-)
> """
>
> 0: http://blog.nemebean.com/content/oslo-virtual-ptg-victoria
>
> On 11/23/20 3:28 AM, Lajos Katona wrote:
> > Hi Erno,
> > Thanks for the details, we will consider these.
> >
> > Regards
> > Lajos
> >
> > Erno Kuvaja <ekuvaja at redhat.com <mailto:ekuvaja at redhat.com>> ezt írta
> > (időpont: 2020. nov. 19., Cs, 18:14):
> >
> > On Mon, Nov 16, 2020 at 10:21 AM Lajos Katona <katonalala at gmail.com
> > <mailto:katonalala at gmail.com>> wrote:
> >
> > Hi,
> >
> > I send this mail out to summarize the discussion around
> > Healthcheck API on Neutron PTG, and start a discussion how we
> > can make this most valuable to the operators.
> >
> > On the Neutron PTG etherpad this topic is from L114:
> > https://etherpad.opendev.org/p/neutron-wallaby-ptg .
> >
> > Background: oslo_middleware provides /healthcheck API path(see
> > [1]), which can be used to poll by services like haproxy, and
> > gives a plugin mechanism to add some more complicated checks,
> > which can be switched on/off from config.
> >
> > The main questions:
> >
> > * Some common guidance what to present to the operators (if
> > you check [2] and [3] in the comments there are really good
> > questions/concerns)
> > o Perhaps the API SIG has something about healtcheck, just
> > I can't find it.
> > * What to present with and without authentication (after
> > checking again, I am not sure that it is possible to use
> > authentication for the healthcheck)
> > o A way forward can be to make it configurable with
> > default to authenticated, and give the decision to the
> > admin.
> > * During the discussion the agreement was to separate the
> > frontend health from the backend health and use direct
> > indicators (like working db connectivity, and mq
> > connectivity) instead of indirect indicators (like agents'
> > health).
> >
> > Thanks in advance for the feedback.
> >
> > [1]
> >
> https://docs.openstack.org/oslo.middleware/latest/reference/healthcheck_plugins.html
> > [2] https://review.opendev.org/731396
> > [3] https://review.opendev.org/731554
> >
> > Regards
> > Lajos Katona (lajoskatona)
> >
> >
> > Hi Lajos,
> >
> > Bit of background in case you don't know. The oslo healthcheck
> > middleware is basically a combination of healthcheck middlewares
> > carried within the few projects ages ago bloated with the plugin
> > framework I don't know of anyone ever adopted using. The main point
> > for those middlewares carried by Swift(I think), Glance definitely
> > and possibly some other projects before osloing it was to give a
> > place for load balancers to ping that does not necessarily need to
> > be logged every few seconds nor need to send the excessive amounts
> > of auth calls to keystone. If I recall correctly you can already
> > place it after keystone middleware if you prefer it being authed, I
> > don't know of anyone who does.
> >
> > Main purpose was to provide a way to detect if the service is not
> > responding or by using the disabled by file to bleed the inflight
> > connections for maintenance and drop them off the pool for new
> > requests. I think the original implementations were somewhere around
> > 10-20 lines of code and did just that job pretty reliably.
> >
> > Based on the plugin model, it's indeed very easy to leak information
> > out of that middleware and I think the plugins used need to take
> > that into account by careful design. I'd very much prefer not
> > breaking the current healthcheck and the very well stabilized API of
> > it that has been in use for years just because someone feels like
> > it's a good idea to make leaky plugins for it. Realizing that agent
> > status might not be the right thing to check is a good start, what
> > you really want to have is indication is the API service able to
> > take in new requests or not, not if all permutations of those
> > requests will succeed on the current system status. Now there are
> > ways to chain multiples of these middlewares with different configs
> > (with different endpoints) and it might be worth considering having
> > your plugins with detailed failure conditions on the admin side that
> > is not exposed to the public and just very simple yei/nei on your
> > public endpoint. Good luck and I hope you find the correct balance
> > of detail from the API and usability.
> >
> > Best,
> > Erno "jokke" Kuvaja
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20201126/514e6244/attachment.html>
More information about the openstack-discuss
mailing list