[nova][all] Adding /healthcheck support in Nova, and better healthcheck in every projects

Thomas Goirand zigo at debian.org
Wed Nov 17 09:22:51 UTC 2021


Hi,

About a year and a half ago, I attempted to add /healthcheck support by
default in all projects. For Nova, this resulted in this patch:

https://review.opendev.org/c/openstack/nova/+/724684

For other projects, it's been merged almost everywhere (I'd have to
survey all project to see if that's the case, or if I still have debian
specific patches somewhere).

Though for Nova, this sparked a discussion where it's been said that the
current implementation of /healthcheck wasn't good enough. This resulted
in threads about how to better do it.

Unfortunately, this blocked my patch from being merged in Nova.

It is my point of view to recognize a failure here. The /healthcheck URL
was added in oslo.middleware so one can use it with something like
haproxy to verify that the API is up, and responds. It was never
designed to check, for example, if nova-api has a valid connectivity to
MySQL and RabbitMQ. Yes, this is welcome, but in the mean time,
operators must tweak the default file to have a valid, useable
/etc/nova/api-paste.ini.

So I am hereby asking the nova team:

Can we please move forward and agree that 1.5 years waiting for such a
minor patch is too long, and that such patch should be approved, prior
to having a better healtcheck mechanism? I don't think it's a good idea
to ask Nova users to wait potentially more development cycles to have a
good-by-default api-paste.ini file.

At the same time, I am wondering: is anyone even working on a better
healthcheck system? I haven't heard that anyone is working on this.
Though it would be more than welcome. Currently, to check that a daemon
is alive and well, operators are stuck with:

- checking with ss if the daemon is correctly connected to a given port
- check the logs for rabbitmq and mysql errors (with something like
filebeat + elastic search and alarming)

Clearly, this doesn't scale. When running many large OpenStack clusters,
it is not trivial to have a monitoring system that works and scales. The
effort to deploy such a monitoring system is also not trivial at all. So
what's been discussed at the time for improving the monitoring would be
very much welcome, though not only for the API service: something to
check the health of other daemons would be very much welcome.

I'd very much would like to participate in a Yoga effort to improve the
current situation, and contribute the best I can, though I'm not sure
I'd be the best person to drive this... Is there anyone else willing to
work on this?

Hoping this message is helpful,
Cheers,

Thomas Goirand (zigo)



More information about the openstack-discuss mailing list