[nova][all] Adding /healthcheck support in Nova, and better healthcheck in every projects
zigo at debian.org
Wed Nov 17 16:03:20 UTC 2021
Hi Sean, thanks for your reply!
On 11/17/21 2:13 PM, Sean Mooney wrote:
> i am currently wokring on an alternitive solution for this cycle.
> i still belive it woudl be incorrect to add teh healtcheck provided by oslo.middelware to nova.
> we disucssed this at the ptg this cycel and still did nto think it was the correct way to approch this
> but we did agree to work on adding an alternitive form of health checks this cycle.
> i fundementally belive bad healthchecks are worse then no helatch checks and the olso midelware provides bad healthchecks.
The current implementation is only useful for plugging haproxy to APIs,
nothing more, nothing less.
> since the /healthcheck denpoint can be added via api-paste.ini manually i dont think we shoudl add it to our
> default or that packageagre shoudl either.
Like it or not, the current state of things is:
- /healthcheck is activated everywhere (I patched that myself)
- The nova package at least in Debian has it activated by default (as
this is the only project that refused the patch, I carry it in the package).
Also, many operators already use the /healthcheck in production, so you
really want to keep it. IMO, your implementation should switch to a
different endpoint if you wish to not retain compatibility with the
For this reason, I strongly believe that the Nova team should be
revising its view from a year and a half, and accept the imperfect
currently implemented /healthcheck. This is not mutually exclusive to a
better implementation bound on some other URL.
> one open question in my draft spec is for the nova api in particaly should we support /healtcheck on the normal api port instead of
> the dedeicated health check endpoint.
You should absolutely not break backward compatibility!!!
> yes so i need to push the spec for review ill see if i can do that today or at a minium this week.
> the tldr is as follows.
> nova will be extended with 2 addtional options to allow a health checks endpoint to be exposed on a tcp port
> and/or a unix socket. these heatlth check endpoints will not be authenticated will be disabel by default.
> all nova binaries (nova-api, nova-schduler, nova-compute, ...) will supprot exposing the endpoint.
> the process will internally update a heathcheck data structure when ever they perform specific operation that
> can be uses as a proxy for the healt of the binary (db query, rpc ping, request to libvirt) these will be binary specific.
> The over all health will be summerised with a status enum, exact values to be determind but im working with (OK, DEGRADED, FAULT)
> for now. in the degraded and fault state there will also be a mesage and likely details filed in the respocne.
> message would be human readable with detail being the actual content of the health check data structure.
> i have not decided if i should use http status codes as part of the way to singal the status, my instinct are saying no
> parsing the json reponce shoudl be simple and if you just need to check the status filed for ok|degreated|falut using a 5XX error code
> in the degraded of fault case would not be semanticly correct.
All you wrote above is great. For the http status codes, please
implement it, because it's cheap, and that's how Zabbix (and probably
other monitoring systems) works, plus everyone understand them.
> Use Cases
> As a operator i want a simple health-check i can consume to know
> if a nova process is OK, Degraded or Faulty.
> As an operator i want this health-check to not impact performance of the
> service so it can be queried frequently at short intervals.
> As a deployment tool implementer i want the health check to be local with no
> dependencies on other hosts or services to function so i can integrate it with
> service managers such as systemd or container runtime like docker
> As a packager i would like health-check to not require special client or
> packages consume them. CURL, socat or netcat should be all that is required to
> connect to the health check and retrieve the service status.
> As an operator i would like to be able to use health-check of the nova api and
> metadata services to manage the membership of endpoints in my load-balancer
> or reverse proxy automatically.
>> Though it would be more than welcome. Currently, to check that a daemon
>> is alive and well, operators are stuck with:
>> - checking with ss if the daemon is correctly connected to a given port
>> - check the logs for rabbitmq and mysql errors (with something like
>> filebeat + elastic search and alarming)
>> Clearly, this doesn't scale. When running many large OpenStack clusters,
>> it is not trivial to have a monitoring system that works and scales. The
>> effort to deploy such a monitoring system is also not trivial at all. So
>> what's been discussed at the time for improving the monitoring would be
>> very much welcome, though not only for the API service: something to
>> check the health of other daemons would be very much welcome.
>> I'd very much would like to participate in a Yoga effort to improve the
>> current situation, and contribute the best I can, though I'm not sure
>> I'd be the best person to drive this... Is there anyone else willing to
>> work on this?
> yep i am feel free to ping me on irc: sean-k-mooney incase your wondering but we have talked before.
Yes. Feel free to ping me as well, I'll enjoy contributing were I can
(though I know you're more skilled than I do in OpenStack's Python
code... I'll still do what I can).
> i have not configured my defualt channels since the change to oftc but im alwasy in at least #openstack-nova
> after discussing this in the nova ptg session the design took a hard right turn from being based on a rpc like protocaol
> exposed over a unix socket with ovos as the data fromat and active probes to a http based endpoint, avaiable over tcp and or unix socket
> with json as the responce format and a semi global data stucutre with TTL for the data.
> as a result i have had to rethink and rework most of the draft spec i had prepared.
> The main point of design that we need to agree on is exactuly how that data stucture is accessed and wehre it is stored.
> in the orginal desing i proposed there was no need to store any kind of state and or modify existing functions to add healchecks.
> each nova service manager would just implemant a new healthcheck function that would be pass as a callback to the healtcheck manager which exposed the endpoint.
> With the new approch we will like add decorators to imporant functions that will update the healthchecks based on if that fucntion complete correctly.
> if we take the decorator because of how decorators work it can only access module level varables, class method/memeber or the parmaters to the function it is decorating.
> what that efffectivly means is either the health check manager need to be stored in a module level "global" variable, it need to be a signelton accessable via a class method
> or it need to be stored in a data stucure that is passed to almost ever funciton speicifcally the context object.
> i am leaning towards the context object but i need to understand how that will interact with RPC calls so it might end up being a global/singelton which sucks form a unit/fucntional testing
> perspective but we can make it work via fixtures.
> hopefully this sould like good news to you but feel free to give feedback.
I don't like the fact that we're still having the discussion 1.5 years
after the proposed patch, and that still delays having Nova following
what all the other projects have approved.
Again, what you're doing should not be mutually exclusive with adding
what already works, and what is already in production. It's been said a
year and a half ago, and it's still truth. A year and a half ago, we
even discuss the fact it would be a shame if it took more than a year...
So can we move forward?
Anyways, I'm excited that this goes forward, so thanks again for leading
Thomas Goirand (zigo)
More information about the openstack-discuss