Hi,
About a year and a half ago, I attempted to add /healthcheck support by default in all projects. For Nova, this resulted in this patch:
https://review.opendev.org/c/openstack/nova/+/724684
For other projects, it's been merged almost everywhere (I'd have to survey all project to see if that's the case, or if I still have debian specific patches somewhere).
Though for Nova, this sparked a discussion where it's been said that the current implementation of /healthcheck wasn't good enough. This resulted in threads about how to better do it.
Unfortunately, this blocked my patch from being merged in Nova.
It is my point of view to recognize a failure here. The /healthcheck URL was added in oslo.middleware so one can use it with something like haproxy to verify that the API is up, and responds. It was never designed to check, for example, if nova-api has a valid connectivity to MySQL and RabbitMQ. Yes, this is welcome, but in the mean time, operators must tweak the default file to have a valid, useable /etc/nova/api-paste.ini.
So I am hereby asking the nova team:
Can we please move forward and agree that 1.5 years waiting for such a minor patch is too long, and that such patch should be approved, prior to having a better healtcheck mechanism? I don't think it's a good idea to ask Nova users to wait potentially more development cycles to have a good-by-default api-paste.ini file.
At the same time, I am wondering: is anyone even working on a better healthcheck system? I haven't heard that anyone is working on this. Though it would be more than welcome. Currently, to check that a daemon is alive and well, operators are stuck with:
- checking with ss if the daemon is correctly connected to a given port - check the logs for rabbitmq and mysql errors (with something like filebeat + elastic search and alarming)
Clearly, this doesn't scale. When running many large OpenStack clusters, it is not trivial to have a monitoring system that works and scales. The effort to deploy such a monitoring system is also not trivial at all. So what's been discussed at the time for improving the monitoring would be very much welcome, though not only for the API service: something to check the health of other daemons would be very much welcome.
I'd very much would like to participate in a Yoga effort to improve the current situation, and contribute the best I can, though I'm not sure I'd be the best person to drive this... Is there anyone else willing to work on this?
Hoping this message is helpful, Cheers,
Thomas Goirand (zigo)