[nova][all] Adding /healthcheck support in Nova, and better healthcheck in every projects

Sean Mooney smooney at redhat.com
Wed Nov 17 13:13:57 UTC 2021

On Wed, 2021-11-17 at 10:22 +0100, Thomas Goirand wrote:
> Hi,
> About a year and a half ago, I attempted to add /healthcheck support by
> default in all projects. For Nova, this resulted in this patch:
> https://review.opendev.org/c/openstack/nova/+/724684
> For other projects, it's been merged almost everywhere (I'd have to
> survey all project to see if that's the case, or if I still have debian
> specific patches somewhere).
> Though for Nova, this sparked a discussion where it's been said that the
> current implementation of /healthcheck wasn't good enough. This resulted
> in threads about how to better do it.
> Unfortunately, this blocked my patch from being merged in Nova.
> It is my point of view to recognize a failure here. The /healthcheck URL
> was added in oslo.middleware so one can use it with something like
> haproxy to verify that the API is up, and responds. It was never
> designed to check, for example, if nova-api has a valid connectivity to
> MySQL and RabbitMQ. Yes, this is welcome, but in the mean time,
> operators must tweak the default file to have a valid, useable
> /etc/nova/api-paste.ini.
> So I am hereby asking the nova team:
> Can we please move forward and agree that 1.5 years waiting for such a
> minor patch is too long, and that such patch should be approved, prior
> to having a better healtcheck mechanism? I don't think it's a good idea
> to ask Nova users to wait potentially more development cycles to have a
> good-by-default api-paste.ini file.
i am currently wokring on an alternitive solution for this cycle.
i still belive it woudl be incorrect to add teh healtcheck provided by oslo.middelware to nova.
we disucssed this at the ptg this cycel and still did nto think it was the correct way to approch this
but we did agree to work on adding an alternitive form of health checks this cycle.
i fundementally belive bad healthchecks are worse then no helatch checks and the olso midelware provides bad healthchecks.

since the /healthcheck denpoint can be added via api-paste.ini manually i dont think we shoudl add it to our
default or that packageagre shoudl either.

one open question in my draft spec is for the nova api in particaly should we support /healtcheck on the normal api port instead of
the dedeicated health check endpoint.

> At the same time, I am wondering: is anyone even working on a better
> healthcheck system? I haven't heard that anyone is working on this.

yes so i need to push the spec for review ill see if i can do that today or at a minium this week.
the tldr is as follows.

nova will be extended with 2 addtional options to allow a health checks endpoint to be exposed on a tcp port
and/or a unix socket. these heatlth check endpoints will not be authenticated will be disabel by default.
all nova binaries (nova-api, nova-schduler, nova-compute, ...) will supprot exposing the endpoint.

the process will internally update a heathcheck data structure when ever they perform specific operation that
can be uses as a proxy for the healt of the binary (db query, rpc ping, request to libvirt) these will be binary specific.

The over all health will be summerised with a status enum, exact values to be determind but im working with (OK, DEGRADED, FAULT)
for now. in the degraded and fault state there will also be a mesage and likely details filed in the respocne.
message would be human readable with detail being the actual content of the health check data structure.

i have not decided if i should use http status codes as part of the way to singal the status, my instinct are saying no
parsing the json reponce shoudl be simple and if you just need to check the status filed for ok|degreated|falut using a 5XX error code
in the  degraded of fault case would not be semanticly correct.

the current set of usecases i am using to drive the desting of the spec are as follows.

Use Cases

As a operator i want a simple health-check i can consume to know
if a nova process is OK, Degraded or Faulty.

As an operator i want this health-check to not impact performance of the
service so it can be queried frequently at short intervals.

As a deployment tool implementer i want the health check to be local with no
dependencies on other hosts or services to function so i can integrate it with
service managers such as systemd or container runtime like docker

As a packager i would like health-check to not require special client or
packages consume them. CURL, socat or netcat should be all that is required to
connect to the health check and retrieve the service status.

As an operator i would like to be able to use health-check of the nova api and
metadata services to manage the membership of endpoints in my load-balancer
or reverse proxy automatically.

> Though it would be more than welcome. Currently, to check that a daemon
> is alive and well, operators are stuck with:
> - checking with ss if the daemon is correctly connected to a given port
> - check the logs for rabbitmq and mysql errors (with something like
> filebeat + elastic search and alarming)
> Clearly, this doesn't scale. When running many large OpenStack clusters,
> it is not trivial to have a monitoring system that works and scales. The
> effort to deploy such a monitoring system is also not trivial at all. So
> what's been discussed at the time for improving the monitoring would be
> very much welcome, though not only for the API service: something to
> check the health of other daemons would be very much welcome.
> I'd very much would like to participate in a Yoga effort to improve the
> current situation, and contribute the best I can, though I'm not sure
> I'd be the best person to drive this... Is there anyone else willing to
> work on this?

yep i am feel free to ping me on irc: sean-k-mooney incase your wondering but we have talked before.
i have not configured my defualt channels since the change to oftc but im alwasy in at least #openstack-nova
after discussing this in the nova ptg session the design took a hard right turn from being based on a rpc like protocaol
exposed over a unix socket with ovos as the data fromat and active probes to a http based endpoint, avaiable over tcp and or unix socket
with json as the responce format and a semi global data stucutre with TTL for the data.

as a result i have had to rethink and rework most of the draft spec i had prepared.
The main point of design that we need to agree on is exactuly how that data stucture is accessed and wehre it is stored.

in the orginal desing i proposed there was no need to store any kind of state and or modify existing functions to add healchecks.
each nova service manager would just implemant a new healthcheck function that would be pass as a callback to the healtcheck manager which exposed the endpoint.

With the new approch we will like add decorators to imporant functions that will update the healthchecks based on if that fucntion complete correctly.
if we take the decorator because of how decorators work it can only access module level varables, class method/memeber or the parmaters to the function it is decorating.
what that efffectivly means is either the health check manager need to be stored in a module level "global" variable, it need to be a signelton accessable via a class method
or it need to be stored in a data stucure that is passed to almost ever funciton speicifcally the context object.

i am leaning towards the context object but i need to understand how that will interact with RPC calls so it might end up being a global/singelton which sucks form a unit/fucntional testing
perspective but we can make it work via fixtures.

hopefully this sould like good news to you but feel free to give feedback.
> Hoping this message is helpful,
> Cheers,
> Thomas Goirand (zigo)

More information about the openstack-discuss mailing list