[nova][all] Adding /healthcheck support in Nova, and better healthcheck in every projects

Mohammed Naser mnaser at vexxhost.com
Wed Nov 17 20:53:22 UTC 2021


I don't think we rely on /healthcheck -- there's nothing healthy about
an API endpoint blindly returning a 200 OK.

You might as well just hit / and accept 300 as a code and that's
exactly the same behaviour.  I support what Sean is bringing up here
and I don't think it makes sense to have a noop /healthcheck that
always gives a 200 OK...seems a bit useless imho

On Wed, Nov 17, 2021 at 11:09 AM Thomas Goirand <zigo at debian.org> wrote:
>
> Hi Sean, thanks for your reply!
>
> On 11/17/21 2:13 PM, Sean Mooney wrote:
> > i am currently wokring on an alternitive solution for this cycle.
>
> gr8!
>
> > i still belive it woudl be incorrect to add teh healtcheck provided by oslo.middelware to nova.
> > we disucssed this at the ptg this cycel and still did nto think it was the correct way to approch this
> > but we did agree to work on adding an alternitive form of health checks this cycle.
> > i fundementally belive bad healthchecks are worse then no helatch checks and the olso midelware provides bad healthchecks.
>
> The current implementation is only useful for plugging haproxy to APIs,
> nothing more, nothing less.
>
> > since the /healthcheck denpoint can be added via api-paste.ini manually i dont think we shoudl add it to our
> > default or that packageagre shoudl either.
>
> Like it or not, the current state of things is:
> - /healthcheck is activated everywhere (I patched that myself)
> - The nova package at least in Debian has it activated by default (as
> this is the only project that refused the patch, I carry it in the package).
>
> Also, many operators already use the /healthcheck in production, so you
> really want to keep it. IMO, your implementation should switch to a
> different endpoint if you wish to not retain compatibility with the
> older system.
>
> For this reason, I strongly believe that the Nova team should be
> revising its view from a year and a half, and accept the imperfect
> currently implemented /healthcheck. This is not mutually exclusive to a
> better implementation bound on some other URL.
>
> > one open question in my draft spec is for the nova api in particaly should we support /healtcheck on the normal api port instead of
> > the dedeicated health check endpoint.
>
> You should absolutely not break backward compatibility!!!
>
> > yes so i need to push the spec for review ill see if i can do that today or at a minium this week.
> > the tldr is as follows.
> >
> > nova will be extended with 2 addtional options to allow a health checks endpoint to be exposed on a tcp port
> > and/or a unix socket. these heatlth check endpoints will not be authenticated will be disabel by default.
> > all nova binaries (nova-api, nova-schduler, nova-compute, ...) will supprot exposing the endpoint.
> >
> > the process will internally update a heathcheck data structure when ever they perform specific operation that
> > can be uses as a proxy for the healt of the binary (db query, rpc ping, request to libvirt) these will be binary specific.
> >
> > The over all health will be summerised with a status enum, exact values to be determind but im working with (OK, DEGRADED, FAULT)
> > for now. in the degraded and fault state there will also be a mesage and likely details filed in the respocne.
> > message would be human readable with detail being the actual content of the health check data structure.
> >
> > i have not decided if i should use http status codes as part of the way to singal the status, my instinct are saying no
> > parsing the json reponce shoudl be simple and if you just need to check the status filed for ok|degreated|falut using a 5XX error code
> > in the  degraded of fault case would not be semanticly correct.
>
> All you wrote above is great. For the http status codes, please
> implement it, because it's cheap, and that's how Zabbix (and probably
> other monitoring systems) works, plus everyone understand them.
>
> > Use Cases
> > ---------
> >
> > As a operator i want a simple health-check i can consume to know
> > if a nova process is OK, Degraded or Faulty.
> >
> > As an operator i want this health-check to not impact performance of the
> > service so it can be queried frequently at short intervals.
> >
> > As a deployment tool implementer i want the health check to be local with no
> > dependencies on other hosts or services to function so i can integrate it with
> > service managers such as systemd or container runtime like docker
> >
> > As a packager i would like health-check to not require special client or
> > packages consume them. CURL, socat or netcat should be all that is required to
> > connect to the health check and retrieve the service status.
> >
> > As an operator i would like to be able to use health-check of the nova api and
> > metadata services to manage the membership of endpoints in my load-balancer
> > or reverse proxy automatically.
> >
> >
> >> Though it would be more than welcome. Currently, to check that a daemon
> >> is alive and well, operators are stuck with:
> >>
> >> - checking with ss if the daemon is correctly connected to a given port
> >> - check the logs for rabbitmq and mysql errors (with something like
> >> filebeat + elastic search and alarming)
> >>
> >> Clearly, this doesn't scale. When running many large OpenStack clusters,
> >> it is not trivial to have a monitoring system that works and scales. The
> >> effort to deploy such a monitoring system is also not trivial at all. So
> >> what's been discussed at the time for improving the monitoring would be
> >> very much welcome, though not only for the API service: something to
> >> check the health of other daemons would be very much welcome.
> >>
> >> I'd very much would like to participate in a Yoga effort to improve the
> >> current situation, and contribute the best I can, though I'm not sure
> >> I'd be the best person to drive this... Is there anyone else willing to
> >> work on this?
> >
> > yep i am feel free to ping me on irc: sean-k-mooney incase your wondering but we have talked before.
>
> Yes. Feel free to ping me as well, I'll enjoy contributing were I can
> (though I know you're more skilled than I do in OpenStack's Python
> code... I'll still do what I can).
>
> > i have not configured my defualt channels since the change to oftc but im alwasy in at least #openstack-nova
> > after discussing this in the nova ptg session the design took a hard right turn from being based on a rpc like protocaol
> > exposed over a unix socket with ovos as the data fromat and active probes to a http based endpoint, avaiable over tcp and or unix socket
> > with json as the responce format and a semi global data stucutre with TTL for the data.
> >
> > as a result i have had to rethink and rework most of the draft spec i had prepared.
> > The main point of design that we need to agree on is exactuly how that data stucture is accessed and wehre it is stored.
> >
> > in the orginal desing i proposed there was no need to store any kind of state and or modify existing functions to add healchecks.
> > each nova service manager would just implemant a new healthcheck function that would be pass as a callback to the healtcheck manager which exposed the endpoint.
> >
> > With the new approch we will like add decorators to imporant functions that will update the healthchecks based on if that fucntion complete correctly.
> > if we take the decorator because of how decorators work it can only access module level varables, class method/memeber or the parmaters to the function it is decorating.
> > what that efffectivly means is either the health check manager need to be stored in a module level "global" variable, it need to be a signelton accessable via a class method
> > or it need to be stored in a data stucure that is passed to almost ever funciton speicifcally the context object.
> >
> > i am leaning towards the context object but i need to understand how that will interact with RPC calls so it might end up being a global/singelton which sucks form a unit/fucntional testing
> > perspective but we can make it work via fixtures.
> >
> > hopefully this sould like good news to you but feel free to give feedback.
>
> I don't like the fact that we're still having the discussion 1.5 years
> after the proposed patch, and that still delays having Nova following
> what all the other projects have approved.
>
> Again, what you're doing should not be mutually exclusive with adding
> what already works, and what is already in production. It's been said a
> year and a half ago, and it's still truth. A year and a half ago, we
> even discuss the fact it would be a shame if it took more than a year...
> So can we move forward?
>
> Anyways, I'm excited that this goes forward, so thanks again for leading
> this initiative.
>
> Cheers,
>
> Thomas Goirand (zigo)
>


-- 
Mohammed Naser
VEXXHOST, Inc.



More information about the openstack-discuss mailing list