Open Stack

Wed Sep 30 22:13:42 UTC 2020

On Wed, 2020-09-30 at 22:08 +0200, Thomas Goirand wrote:
> Hi Sean,
> 
> Thanks for your reply and sharing your concerns. I still don't agree
> with you, and here's why.
> 
> On 9/30/20 12:29 PM, Sean Mooney wrote:
> > On Thu, 2020-09-24 at 22:39 +0200, Thomas Goirand wrote:
> > > I'm still available to attempt the /healthcheck thingy, I kind of
> > > succeed in all major project but ... nova. Unfortunately, it was decided
> > > in the project that we should put this on hold until the /healthcheck
> > > can implement more check than just to know if the API is alive. 5 months
> > > forward, I believe my original patch [1] should have been approved first
> > > as a first approach. Nova team: any reaction?
> > 
> > i actually dont think the healthcheck enpoint is userful in it current from for any porject
> > that has distibuted process like nova, neutron or cinder.
> 
> With my operator's hat on, I assure you that it is very useful to
> wire-up haproxy. This is btw exactly what it was designed for.
> 
> The way haproxy works, is that it actually has to perform an http
> connection to check if the web service is alive. Without a specific URL,
> we can't filter-out that /healthcheck URL from the saved logs in Elastic
> search, which is very annoying.
cant you just hit the root of the service? that is unauthenticated for microversion 
version discovery so haproxy could simple use / for a http check if its just bing used
to test if the rest api is running.
> 
> Not doing a real HTTP check means one falls back to a TCP check, which
> means that your logs are polluted with so many "client disconnected
> unexpectedly" (that's not the actual message, but that's what it is
> doing) since haproxy otherwise does a TCP connection, then closes it
> before what a normal HTTP query would do.
> 
> I've been told to use other URLs, which isn't the way to go. I very much
> think the /healthcheck URL was well designed, and should be used.
> 
> > that was not the only concern raised
> > either as the default content of the detail responce wich include package infomation was considerd
> > a possible security vulnerablity so with out agreeing on what kindo fo info can be retruned, its format and
> >  wether this would be a admin only endpoint or a public endpoint tenant can check potentially without auth
> > i dont think we should be procedding with this as a goal.
> 
> Are you aware that one could simply use "/" of most projects without
> auth, and get the version of the project? Example with Nova:
> 
> curl -s https://api.example.com/compute/ | jq .
> 
> {
>   "versions": [
>     {
>       "id": "v2.0",
>       "status": "SUPPORTED",
>       "version": "",
>       "min_version": "",
>       "updated": "2011-01-21T11:33:21Z",
>       "links": [
>         {
>           "rel": "self",
>           "href": "https://api.example.com/v2/"
>         }
>       ]
>     },
>     {
>       "id": "v2.1",
>       "status": "CURRENT",
>       "version": "2.79",
>       "min_version": "2.1",
>       "updated": "2013-07-23T11:33:21Z",
>       "links": [
>         {
>           "rel": "self",
>           "href": "https://clint1-api.cloud.infomaniak.ch/v2.1/"
>         }
>       ]
>     }
>   ]
> }
> 
yes so this is the endpoint i would expect peopel to use as an alternitive to /healtcheck.
im not suggesting we do not 
> I believe "version": "2.79" is the microversion of the Nova API, which
> therefore, exposes what version of Nova (here: Train). Am I correct?
no you are not.
it does not expose the package infomation it tells you the make microversion the api support but
that is a different thing. we dont always bump the microversion in a release.
ussuri and victoria but share the same microversion
https://docs.openstack.org/nova/latest/reference/api-microversion-history.html#maximum-in-ussuri-and-victoria

the microverion also wont change on a stable branch no matter what bugs exist or have been patched.

>  I
> believe we also must leave this, because clients must be able to
> discover the micro-version of the API, right?
yes without this no client can determin what api version is supported by a specific cloud.
this is intened to be a public endpoint with no auth for that reason.
> 
> Then, if I'm right, isn't this a much more annoying problem than having
> a /healtcheck URL which could anyway be filtered by an HAProxy in front?
im not following this was intended to be public form the start and does not ahve the same issue as the health check
api 
> (note: I don't think the above is really a problem, since the
> micro-version doesn't tell if Nova has been patched for security or not...)
correct its not and it is the endpoint that i would suggest using in the absence of a /healtcheck endpoint until
we ca develop one that actuly reports the healt of the service and not just the api.
> 
> > https://github.com/openstack/oslo.middleware/blob/master/oslo_middleware/healthcheck/__init__.py#L150-L152
> > ^ is concerning from a security point of view.
> 
> This needs to be explicitly enabled in the api-paste.ini, and is
> otherwise not displayed. Here's what I get in my Train deployment with
> the query suggested in the example you gave:
> 
> {
>     "detailed": false,
>     "reasons": []
> }
> 
> The code in here:
> https://github.com/openstack/oslo.middleware/blob/master/oslo_middleware/healthcheck/__init__.py#L387
> 
> shows I'm right. That's also well documented here:
> https://docs.openstack.org/oslo.middleware/latest/reference/healthcheck_plugins.html
> 
> See where it says this:
> 
> # set to True to enable detailed output, False is the default
> detailed = False
> 
> in the api-paste.ini example.
> 
> So here, the point you are making isn't IMO valid.
> 
> > nova support configurable middelway still through the api-paste.ini file
> > so untill we have a useful health check
> 
> As we discussed earlier, what is *not* useful, is any healthcheck that
> would do more than what oslo.middleware does right now without caching
> things, because that /healthcheck URL is typically called multiple times
> per seconds (in our deployment: at least 6 times per seconds), so it
> needs to reply fast. So I hope that the super-nice-healthcheck thingy
> wont fry my server CPUs ... :)
what we were thinking was basically checking that the from the api that services that is handeling the  request we would
confirm the  db and message bus were accessable, and that api instance could reach the conductor and maybe the schedler
services  or assert tehy were active  via a db check.

if the generic oslo ping rpc was added we coudl use that but i think dansmith had a simpler proposal for caching it
based on if we were able to connect during normal operation and jsut have the api check look at teh in memeory value.
i.e. if the last attempt to read form the db failed to connect we would set a global variable  e.g. DB_ACCESSIBLE=FALSE
and then the next time it succeded we set it to True. the health check woudl just read the global so there should be
little to no overhead vs what oslo does

this would basically cache the last knon state and the health check is just doing the equivalent of 
return DB_ACCESSIBLE and RPC_ACCESSIBLE

if detail=true was set it could do a more advanced check and check the service status exctra which would be a dbquery
but detail=false is just a memory read of two variables.

> 
> What is *not* useful as well, is delaying such a trivial patch for more
> than 6 months, just in the hope that in a distant future, we may have
> something better.

but as you yourself pointed out almost every service has a / enpoint that is used for microverion discovery that is
public so not implementing /healtcheck in nova does not block you using / as the healthcheck url and you can enable the
oslo endpoint if you chose too by enable the middle ware in your local deployment.

> 
> Sure, take your time, get something implemented that does a nice
> healtcheck with db access and rabbitmq connectivity checks. But that
> should in no way get in the path of having a configuration which works
> for everyone by default.
there is nothing stopping install tools providing that experience by default today.
at least as long as nova support configurable middleware they can enable or even enable the /healthcheck endpoint
by default without requiring nova code change. i have looked at enough customer bug to know that network partions
are common in real envionments where someone trying to use the /healthcheck endpoint to know if nova is healty would
be severly disapointed when it says its healty and they cant boot any vms because rabbitmq is not reachable. for
usecause outside fo haproxy failover a bad health check is arguable worse then no healthcheck.

 im not unsympathic to your request but with what oslo does by default we would basically have to document that this
should not be used to monitor the healthy of the nova service to prempt the bug reports we would get from customers
related to this. we have already got several bug reports to the status of vm not matching reality when connectivity to
the cell is down. e.g. when we cant connect to the cell database if the vm is stoped say via a power off vis ssh then
its state will not be reflected in a nova show.

if we were willing to add a big warning and clearly call out that this is just saying the api is accesable but not
necessarily functional then i would be more ok with what olso provides but it does not tell you anything about the
health of nova or if any other api request will actually work.

i would suggest adding this to the nova ptg etherpad if you want to move this forward in nova in particular.
> 
> Cheers,
> 
> Thomas Goirand (zigo)
> 

Open Stack

[tc][all] Wallaby Cycle Community Goals

OpenStack

Community

Documentation

Branding & Legal