Hi, I send this mail out to summarize the discussion around Healthcheck API on Neutron PTG, and start a discussion how we can make this most valuable to the operators. On the Neutron PTG etherpad this topic is from L114: https://etherpad.opendev.org/p/neutron-wallaby-ptg . Background: oslo_middleware provides /healthcheck API path(see [1]), which can be used to poll by services like haproxy, and gives a plugin mechanism to add some more complicated checks, which can be switched on/off from config. The main questions: - Some common guidance what to present to the operators (if you check [2] and [3] in the comments there are really good questions/concerns) - Perhaps the API SIG has something about healtcheck, just I can't find it. - What to present with and without authentication (after checking again, I am not sure that it is possible to use authentication for the healthcheck) - A way forward can be to make it configurable with default to authenticated, and give the decision to the admin. - During the discussion the agreement was to separate the frontend health from the backend health and use direct indicators (like working db connectivity, and mq connectivity) instead of indirect indicators (like agents' health). Thanks in advance for the feedback. [1] https://docs.openstack.org/oslo.middleware/latest/reference/healthcheck_plug... [2] https://review.opendev.org/731396 [3] https://review.opendev.org/731554 Regards Lajos Katona (lajoskatona)
On Mon, Nov 16, 2020 at 10:21 AM Lajos Katona <katonalala@gmail.com> wrote:
Hi,
I send this mail out to summarize the discussion around Healthcheck API on Neutron PTG, and start a discussion how we can make this most valuable to the operators.
On the Neutron PTG etherpad this topic is from L114: https://etherpad.opendev.org/p/neutron-wallaby-ptg .
Background: oslo_middleware provides /healthcheck API path(see [1]), which can be used to poll by services like haproxy, and gives a plugin mechanism to add some more complicated checks, which can be switched on/off from config.
The main questions:
- Some common guidance what to present to the operators (if you check [2] and [3] in the comments there are really good questions/concerns) - Perhaps the API SIG has something about healtcheck, just I can't find it. - What to present with and without authentication (after checking again, I am not sure that it is possible to use authentication for the healthcheck) - A way forward can be to make it configurable with default to authenticated, and give the decision to the admin. - During the discussion the agreement was to separate the frontend health from the backend health and use direct indicators (like working db connectivity, and mq connectivity) instead of indirect indicators (like agents' health).
Thanks in advance for the feedback.
[1] https://docs.openstack.org/oslo.middleware/latest/reference/healthcheck_plug... [2] https://review.opendev.org/731396 [3] https://review.opendev.org/731554
Regards Lajos Katona (lajoskatona)
Hi Lajos, Bit of background in case you don't know. The oslo healthcheck middleware is basically a combination of healthcheck middlewares carried within the few projects ages ago bloated with the plugin framework I don't know of anyone ever adopted using. The main point for those middlewares carried by Swift(I think), Glance definitely and possibly some other projects before osloing it was to give a place for load balancers to ping that does not necessarily need to be logged every few seconds nor need to send the excessive amounts of auth calls to keystone. If I recall correctly you can already place it after keystone middleware if you prefer it being authed, I don't know of anyone who does. Main purpose was to provide a way to detect if the service is not responding or by using the disabled by file to bleed the inflight connections for maintenance and drop them off the pool for new requests. I think the original implementations were somewhere around 10-20 lines of code and did just that job pretty reliably. Based on the plugin model, it's indeed very easy to leak information out of that middleware and I think the plugins used need to take that into account by careful design. I'd very much prefer not breaking the current healthcheck and the very well stabilized API of it that has been in use for years just because someone feels like it's a good idea to make leaky plugins for it. Realizing that agent status might not be the right thing to check is a good start, what you really want to have is indication is the API service able to take in new requests or not, not if all permutations of those requests will succeed on the current system status. Now there are ways to chain multiples of these middlewares with different configs (with different endpoints) and it might be worth considering having your plugins with detailed failure conditions on the admin side that is not exposed to the public and just very simple yei/nei on your public endpoint. Good luck and I hope you find the correct balance of detail from the API and usability. Best, Erno "jokke" Kuvaja
Hi Erno, Thanks for the details, we will consider these. Regards Lajos Erno Kuvaja <ekuvaja@redhat.com> ezt írta (időpont: 2020. nov. 19., Cs, 18:14):
On Mon, Nov 16, 2020 at 10:21 AM Lajos Katona <katonalala@gmail.com> wrote:
Hi,
I send this mail out to summarize the discussion around Healthcheck API on Neutron PTG, and start a discussion how we can make this most valuable to the operators.
On the Neutron PTG etherpad this topic is from L114: https://etherpad.opendev.org/p/neutron-wallaby-ptg .
Background: oslo_middleware provides /healthcheck API path(see [1]), which can be used to poll by services like haproxy, and gives a plugin mechanism to add some more complicated checks, which can be switched on/off from config.
The main questions:
- Some common guidance what to present to the operators (if you check [2] and [3] in the comments there are really good questions/concerns) - Perhaps the API SIG has something about healtcheck, just I can't find it. - What to present with and without authentication (after checking again, I am not sure that it is possible to use authentication for the healthcheck) - A way forward can be to make it configurable with default to authenticated, and give the decision to the admin. - During the discussion the agreement was to separate the frontend health from the backend health and use direct indicators (like working db connectivity, and mq connectivity) instead of indirect indicators (like agents' health).
Thanks in advance for the feedback.
[1] https://docs.openstack.org/oslo.middleware/latest/reference/healthcheck_plug... [2] https://review.opendev.org/731396 [3] https://review.opendev.org/731554
Regards Lajos Katona (lajoskatona)
Hi Lajos,
Bit of background in case you don't know. The oslo healthcheck middleware is basically a combination of healthcheck middlewares carried within the few projects ages ago bloated with the plugin framework I don't know of anyone ever adopted using. The main point for those middlewares carried by Swift(I think), Glance definitely and possibly some other projects before osloing it was to give a place for load balancers to ping that does not necessarily need to be logged every few seconds nor need to send the excessive amounts of auth calls to keystone. If I recall correctly you can already place it after keystone middleware if you prefer it being authed, I don't know of anyone who does.
Main purpose was to provide a way to detect if the service is not responding or by using the disabled by file to bleed the inflight connections for maintenance and drop them off the pool for new requests. I think the original implementations were somewhere around 10-20 lines of code and did just that job pretty reliably.
Based on the plugin model, it's indeed very easy to leak information out of that middleware and I think the plugins used need to take that into account by careful design. I'd very much prefer not breaking the current healthcheck and the very well stabilized API of it that has been in use for years just because someone feels like it's a good idea to make leaky plugins for it. Realizing that agent status might not be the right thing to check is a good start, what you really want to have is indication is the API service able to take in new requests or not, not if all permutations of those requests will succeed on the current system status. Now there are ways to chain multiples of these middlewares with different configs (with different endpoints) and it might be worth considering having your plugins with detailed failure conditions on the admin side that is not exposed to the public and just very simple yei/nei on your public endpoint. Good luck and I hope you find the correct balance of detail from the API and usability.
Best, Erno "jokke" Kuvaja
I finally found where I wrote down my notes from the discussion we had about this with nova last cycle[0]. """ Another longstanding topic that has recently come up is a standard healthcheck endpoint for OpenStack services. In the process of enabling the existing healthcheck middleware there was some question of how the healthchecks should work. Currently it's a very simple check: if the api process is running it returns success. There is also an option to suppress the healthcheck based on the existence of a file. This allows a deployer to signal a loadbalancer that the api will be going down for maintenance. However, there is obviously a lot more that goes into a given service's health. We've been discussing how to make the healthcheck more comprehensive since at least the Dublin PTG, but so far no one has been able to commit the time to make any of these plans happen. At the Denver PTG ~a year ago we agreed that the first step was to enable the healthcheck middleware by default in all services. Some progress has been made on that front, but when the change was proposed to Nova, they asked a number of the questions related to the future improvements. We revisited some of those questions at this PTG and came up with a plan to move forward that everyone seemed happy with. One concern was that we don't want to trigger resource-intensive healthchecks on unauthenticated calls to an API. In the original discussions the plan was to have healthchecks running in the background, and then the API call would just return the latest results of the async checks. A small modification to that was made in this discussion. Instead of having explicit async processes to gather this data, it will be collected on regular authenticated API calls. In this way, regularly used functionality will be healthchecked more frequently, whereas less used areas of the service will not. In addition, only authenticated users will be able to trigger potentially resource intensive healthchecks. Each project will be responsible for implementing these checks. Since each project has a different architecture only they can say what constitutes "healthy" for their service. It's possible we could provide some common code for things like messaging and database that are used in many services, but it's likely that many projects will also need some custom checks. I think that covers the major outcomes of this discussion, but we have no notes from this session so if I forgot something let me know. ;-) """ 0: http://blog.nemebean.com/content/oslo-virtual-ptg-victoria On 11/23/20 3:28 AM, Lajos Katona wrote:
Hi Erno, Thanks for the details, we will consider these.
Regards Lajos
Erno Kuvaja <ekuvaja@redhat.com <mailto:ekuvaja@redhat.com>> ezt írta (időpont: 2020. nov. 19., Cs, 18:14):
On Mon, Nov 16, 2020 at 10:21 AM Lajos Katona <katonalala@gmail.com <mailto:katonalala@gmail.com>> wrote:
Hi,
I send this mail out to summarize the discussion around Healthcheck API on Neutron PTG, and start a discussion how we can make this most valuable to the operators.
On the Neutron PTG etherpad this topic is from L114: https://etherpad.opendev.org/p/neutron-wallaby-ptg .
Background: oslo_middleware provides /healthcheck API path(see [1]), which can be used to poll by services like haproxy, and gives a plugin mechanism to add some more complicated checks, which can be switched on/off from config.
The main questions:
* Some common guidance what to present to the operators (if you check [2] and [3] in the comments there are really good questions/concerns) o Perhaps the API SIG has something about healtcheck, just I can't find it. * What to present with and without authentication (after checking again, I am not sure that it is possible to use authentication for the healthcheck) o A way forward can be to make it configurable with default to authenticated, and give the decision to the admin. * During the discussion the agreement was to separate the frontend health from the backend health and use direct indicators (like working db connectivity, and mq connectivity) instead of indirect indicators (like agents' health).
Thanks in advance for the feedback.
[1] https://docs.openstack.org/oslo.middleware/latest/reference/healthcheck_plug... [2] https://review.opendev.org/731396 [3] https://review.opendev.org/731554
Regards Lajos Katona (lajoskatona)
Hi Lajos,
Bit of background in case you don't know. The oslo healthcheck middleware is basically a combination of healthcheck middlewares carried within the few projects ages ago bloated with the plugin framework I don't know of anyone ever adopted using. The main point for those middlewares carried by Swift(I think), Glance definitely and possibly some other projects before osloing it was to give a place for load balancers to ping that does not necessarily need to be logged every few seconds nor need to send the excessive amounts of auth calls to keystone. If I recall correctly you can already place it after keystone middleware if you prefer it being authed, I don't know of anyone who does.
Main purpose was to provide a way to detect if the service is not responding or by using the disabled by file to bleed the inflight connections for maintenance and drop them off the pool for new requests. I think the original implementations were somewhere around 10-20 lines of code and did just that job pretty reliably.
Based on the plugin model, it's indeed very easy to leak information out of that middleware and I think the plugins used need to take that into account by careful design. I'd very much prefer not breaking the current healthcheck and the very well stabilized API of it that has been in use for years just because someone feels like it's a good idea to make leaky plugins for it. Realizing that agent status might not be the right thing to check is a good start, what you really want to have is indication is the API service able to take in new requests or not, not if all permutations of those requests will succeed on the current system status. Now there are ways to chain multiples of these middlewares with different configs (with different endpoints) and it might be worth considering having your plugins with detailed failure conditions on the admin side that is not exposed to the public and just very simple yei/nei on your public endpoint. Good luck and I hope you find the correct balance of detail from the API and usability.
Best, Erno "jokke" Kuvaja
Thanks Ben, This was again really helpful to understand the history of this feature, and have a glimpse of the future direction. I think with this Neutron community (and others of course) can plan now. Regards Lajos Katona (lajoskatona) Ben Nemec <openstack@nemebean.com> ezt írta (időpont: 2020. nov. 25., Sze, 18:51):
I finally found where I wrote down my notes from the discussion we had about this with nova last cycle[0].
""" Another longstanding topic that has recently come up is a standard healthcheck endpoint for OpenStack services. In the process of enabling the existing healthcheck middleware there was some question of how the healthchecks should work. Currently it's a very simple check: if the api process is running it returns success. There is also an option to suppress the healthcheck based on the existence of a file. This allows a deployer to signal a loadbalancer that the api will be going down for maintenance.
However, there is obviously a lot more that goes into a given service's health. We've been discussing how to make the healthcheck more comprehensive since at least the Dublin PTG, but so far no one has been able to commit the time to make any of these plans happen. At the Denver PTG ~a year ago we agreed that the first step was to enable the healthcheck middleware by default in all services. Some progress has been made on that front, but when the change was proposed to Nova, they asked a number of the questions related to the future improvements.
We revisited some of those questions at this PTG and came up with a plan to move forward that everyone seemed happy with. One concern was that we don't want to trigger resource-intensive healthchecks on unauthenticated calls to an API. In the original discussions the plan was to have healthchecks running in the background, and then the API call would just return the latest results of the async checks. A small modification to that was made in this discussion. Instead of having explicit async processes to gather this data, it will be collected on regular authenticated API calls. In this way, regularly used functionality will be healthchecked more frequently, whereas less used areas of the service will not. In addition, only authenticated users will be able to trigger potentially resource intensive healthchecks.
Each project will be responsible for implementing these checks. Since each project has a different architecture only they can say what constitutes "healthy" for their service. It's possible we could provide some common code for things like messaging and database that are used in many services, but it's likely that many projects will also need some custom checks.
I think that covers the major outcomes of this discussion, but we have no notes from this session so if I forgot something let me know. ;-) """
0: http://blog.nemebean.com/content/oslo-virtual-ptg-victoria
On 11/23/20 3:28 AM, Lajos Katona wrote:
Hi Erno, Thanks for the details, we will consider these.
Regards Lajos
Erno Kuvaja <ekuvaja@redhat.com <mailto:ekuvaja@redhat.com>> ezt írta (időpont: 2020. nov. 19., Cs, 18:14):
On Mon, Nov 16, 2020 at 10:21 AM Lajos Katona <katonalala@gmail.com <mailto:katonalala@gmail.com>> wrote:
Hi,
I send this mail out to summarize the discussion around Healthcheck API on Neutron PTG, and start a discussion how we can make this most valuable to the operators.
On the Neutron PTG etherpad this topic is from L114: https://etherpad.opendev.org/p/neutron-wallaby-ptg .
Background: oslo_middleware provides /healthcheck API path(see [1]), which can be used to poll by services like haproxy, and gives a plugin mechanism to add some more complicated checks, which can be switched on/off from config.
The main questions:
* Some common guidance what to present to the operators (if you check [2] and [3] in the comments there are really good questions/concerns) o Perhaps the API SIG has something about healtcheck, just I can't find it. * What to present with and without authentication (after checking again, I am not sure that it is possible to use authentication for the healthcheck) o A way forward can be to make it configurable with default to authenticated, and give the decision to the admin. * During the discussion the agreement was to separate the frontend health from the backend health and use direct indicators (like working db connectivity, and mq connectivity) instead of indirect indicators (like agents' health).
Thanks in advance for the feedback.
[1]
https://docs.openstack.org/oslo.middleware/latest/reference/healthcheck_plug...
[2] https://review.opendev.org/731396 [3] https://review.opendev.org/731554
Regards Lajos Katona (lajoskatona)
Hi Lajos,
Bit of background in case you don't know. The oslo healthcheck middleware is basically a combination of healthcheck middlewares carried within the few projects ages ago bloated with the plugin framework I don't know of anyone ever adopted using. The main point for those middlewares carried by Swift(I think), Glance definitely and possibly some other projects before osloing it was to give a place for load balancers to ping that does not necessarily need to be logged every few seconds nor need to send the excessive amounts of auth calls to keystone. If I recall correctly you can already place it after keystone middleware if you prefer it being authed, I don't know of anyone who does.
Main purpose was to provide a way to detect if the service is not responding or by using the disabled by file to bleed the inflight connections for maintenance and drop them off the pool for new requests. I think the original implementations were somewhere around 10-20 lines of code and did just that job pretty reliably.
Based on the plugin model, it's indeed very easy to leak information out of that middleware and I think the plugins used need to take that into account by careful design. I'd very much prefer not breaking the current healthcheck and the very well stabilized API of it that has been in use for years just because someone feels like it's a good idea to make leaky plugins for it. Realizing that agent status might not be the right thing to check is a good start, what you really want to have is indication is the API service able to take in new requests or not, not if all permutations of those requests will succeed on the current system status. Now there are ways to chain multiples of these middlewares with different configs (with different endpoints) and it might be worth considering having your plugins with detailed failure conditions on the admin side that is not exposed to the public and just very simple yei/nei on your public endpoint. Good luck and I hope you find the correct balance of detail from the API and usability.
Best, Erno "jokke" Kuvaja
participants (3)
-
Ben Nemec
-
Erno Kuvaja
-
Lajos Katona