<div dir="ltr">Thanks Ben,<div><br></div><div>This was again really helpful to understand the history of this feature, and have a glimpse of the future direction.</div><div><br></div><div>I think with this Neutron community (and others of course) can plan now.</div><div><br></div><div>Regards</div><div>Lajos Katona</div><div>(lajoskatona)</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Ben Nemec <<a href="mailto:openstack@nemebean.com">openstack@nemebean.com</a>> ezt írta (időpont: 2020. nov. 25., Sze, 18:51):<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">I finally found where I wrote down my notes from the discussion we had <br>

about this with nova last cycle[0].<br>

<br>

"""<br>

Another longstanding topic that has recently come up is a standard <br>

healthcheck endpoint for OpenStack services. In the process of enabling <br>

the existing healthcheck middleware there was some question of how the <br>

healthchecks should work. Currently it's a very simple check: if the api <br>

process is running it returns success. There is also an option to <br>

suppress the healthcheck based on the existence of a file. This allows a <br>

deployer to signal a loadbalancer that the api will be going down for <br>

maintenance.<br>

<br>

However, there is obviously a lot more that goes into a given service's <br>

health. We've been discussing how to make the healthcheck more <br>

comprehensive since at least the Dublin PTG, but so far no one has been <br>

able to commit the time to make any of these plans happen. At the Denver <br>

PTG ~a year ago we agreed that the first step was to enable the <br>

healthcheck middleware by default in all services. Some progress has <br>

been made on that front, but when the change was proposed to Nova, they <br>

asked a number of the questions related to the future improvements.<br>

<br>

We revisited some of those questions at this PTG and came up with a plan <br>

to move forward that everyone seemed happy with. One concern was that we <br>

don't want to trigger resource-intensive healthchecks on unauthenticated <br>

calls to an API. In the original discussions the plan was to have <br>

healthchecks running in the background, and then the API call would just <br>

return the latest results of the async checks. A small modification to <br>

that was made in this discussion. Instead of having explicit async <br>

processes to gather this data, it will be collected on regular <br>

authenticated API calls. In this way, regularly used functionality will <br>

be healthchecked more frequently, whereas less used areas of the service <br>

will not. In addition, only authenticated users will be able to trigger <br>

potentially resource intensive healthchecks.<br>

<br>

Each project will be responsible for implementing these checks. Since <br>

each project has a different architecture only they can say what <br>

constitutes "healthy" for their service. It's possible we could provide <br>

some common code for things like messaging and database that are used in <br>

many services, but it's likely that many projects will also need some <br>

custom checks.<br>

<br>

I think that covers the major outcomes of this discussion, but we have <br>

no notes from this session so if I forgot something let me know. ;-)<br>

"""<br>

<br>

0: <a href="http://blog.nemebean.com/content/oslo-virtual-ptg-victoria" rel="noreferrer" target="_blank">http://blog.nemebean.com/content/oslo-virtual-ptg-victoria</a><br>

<br>

On 11/23/20 3:28 AM, Lajos Katona wrote:<br>

> Hi Erno,<br>

> Thanks for the details, we will consider these.<br>

> <br>

> Regards<br>

> Lajos<br>

> <br>

> Erno Kuvaja <<a href="mailto:ekuvaja@redhat.com" target="_blank">ekuvaja@redhat.com</a> <mailto:<a href="mailto:ekuvaja@redhat.com" target="_blank">ekuvaja@redhat.com</a>>> ezt írta <br>

> (időpont: 2020. nov. 19., Cs, 18:14):<br>

> <br>

>     On Mon, Nov 16, 2020 at 10:21 AM Lajos Katona <<a href="mailto:katonalala@gmail.com" target="_blank">katonalala@gmail.com</a><br>

>     <mailto:<a href="mailto:katonalala@gmail.com" target="_blank">katonalala@gmail.com</a>>> wrote:<br>

> <br>

>         Hi,<br>

> <br>

>         I send this mail out to summarize the discussion around<br>

>         Healthcheck API on Neutron PTG, and start a discussion how we<br>

>         can make this most valuable to the operators.<br>

> <br>

>         On the Neutron PTG etherpad this topic is from L114:<br>

>         <a href="https://etherpad.opendev.org/p/neutron-wallaby-ptg" rel="noreferrer" target="_blank">https://etherpad.opendev.org/p/neutron-wallaby-ptg</a> .<br>

> <br>

>         Background: oslo_middleware provides /healthcheck API path(see<br>

>         [1]), which can be used to poll by services like haproxy, and<br>

>         gives a plugin mechanism to add some more complicated checks,<br>

>         which can be switched on/off from config.<br>

> <br>

>         The main questions:<br>

> <br>

>           * Some common guidance what to present to the operators (if<br>

>             you check [2] and [3] in the comments there are really good<br>

>             questions/concerns)<br>

>               o Perhaps the API SIG has something about healtcheck, just<br>

>                 I can't find it.<br>

>           * What to present with and without authentication (after<br>

>             checking again, I am not sure that it is possible to use<br>

>             authentication for the healthcheck)<br>

>               o A way forward can be to make it configurable with<br>

>                 default to authenticated, and give the decision to the<br>

>                 admin.<br>

>           * During the discussion the agreement was to separate the<br>

>             frontend health from the backend health and use direct<br>

>             indicators (like working db connectivity, and mq<br>

>             connectivity) instead of indirect indicators (like agents'<br>

>             health).<br>

> <br>

>         Thanks in advance for the feedback.<br>

> <br>

>         [1]<br>

>         <a href="https://docs.openstack.org/oslo.middleware/latest/reference/healthcheck_plugins.html" rel="noreferrer" target="_blank">https://docs.openstack.org/oslo.middleware/latest/reference/healthcheck_plugins.html</a><br>

>         [2] <a href="https://review.opendev.org/731396" rel="noreferrer" target="_blank">https://review.opendev.org/731396</a><br>

>         [3] <a href="https://review.opendev.org/731554" rel="noreferrer" target="_blank">https://review.opendev.org/731554</a><br>

> <br>

>         Regards<br>

>         Lajos Katona (lajoskatona)<br>

> <br>

> <br>

>     Hi Lajos,<br>

> <br>

>     Bit of background in case you don't know. The oslo healthcheck<br>

>     middleware is basically a combination of healthcheck middlewares<br>

>     carried within the few projects ages ago bloated with the plugin<br>

>     framework I don't know of anyone ever adopted using. The main point<br>

>     for those middlewares carried by Swift(I think), Glance definitely<br>

>     and possibly some other projects before osloing it was to give a<br>

>     place for load balancers to ping that does not necessarily need to<br>

>     be logged every few seconds nor need to send the excessive amounts<br>

>     of auth calls to keystone. If I recall correctly you can already<br>

>     place it after keystone middleware if you prefer it being authed, I<br>

>     don't know of anyone who does.<br>

> <br>

>     Main purpose was to provide a way to detect if the service is not<br>

>     responding or by using the disabled by file to bleed the inflight<br>

>     connections for maintenance and drop them off the pool for new<br>

>     requests. I think the original implementations were somewhere around<br>

>     10-20 lines of code and did just that job pretty reliably.<br>

> <br>

>     Based on the plugin model, it's indeed very easy to leak information<br>

>     out of that middleware and I think the plugins used need to take<br>

>     that into account by careful design. I'd very much prefer not<br>

>     breaking the current healthcheck and the very well stabilized API of<br>

>     it that has been in use for years just because someone feels like<br>

>     it's a good idea to make leaky plugins for it. Realizing that agent<br>

>     status might not be the right thing to check is a good start, what<br>

>     you really want to have is indication is the API service able to<br>

>     take in new requests or not, not if all permutations of those<br>

>     requests will succeed on the current system status. Now there are<br>

>     ways to chain multiples of these middlewares with different configs<br>

>     (with different endpoints) and it might be worth considering having<br>

>     your plugins with detailed failure conditions on the admin side that<br>

>     is not exposed to the public and just very simple yei/nei on your<br>

>     public endpoint. Good luck and I hope you find the correct balance<br>

>     of detail from the API and usability.<br>

> <br>

>     Best,<br>

>     Erno "jokke" Kuvaja<br>

> <br>

</blockquote></div>