[nova][all] Adding /healthcheck support in Nova, and better healthcheck in every projects

older
[neutron] Drivers meeting - Friday...

Thomas Goirand

17 Nov 2021 17 Nov '21

1:22 a.m.

Hi, About a year and a half ago, I attempted to add /healthcheck support by default in all projects. For Nova, this resulted in this patch: https://review.opendev.org/c/openstack/nova/+/724684 For other projects, it's been merged almost everywhere (I'd have to survey all project to see if that's the case, or if I still have debian specific patches somewhere). Though for Nova, this sparked a discussion where it's been said that the current implementation of /healthcheck wasn't good enough. This resulted in threads about how to better do it. Unfortunately, this blocked my patch from being merged in Nova. It is my point of view to recognize a failure here. The /healthcheck URL was added in oslo.middleware so one can use it with something like haproxy to verify that the API is up, and responds. It was never designed to check, for example, if nova-api has a valid connectivity to MySQL and RabbitMQ. Yes, this is welcome, but in the mean time, operators must tweak the default file to have a valid, useable /etc/nova/api-paste.ini. So I am hereby asking the nova team: Can we please move forward and agree that 1.5 years waiting for such a minor patch is too long, and that such patch should be approved, prior to having a better healtcheck mechanism? I don't think it's a good idea to ask Nova users to wait potentially more development cycles to have a good-by-default api-paste.ini file. At the same time, I am wondering: is anyone even working on a better healthcheck system? I haven't heard that anyone is working on this. Though it would be more than welcome. Currently, to check that a daemon is alive and well, operators are stuck with: - checking with ss if the daemon is correctly connected to a given port - check the logs for rabbitmq and mysql errors (with something like filebeat + elastic search and alarming) Clearly, this doesn't scale. When running many large OpenStack clusters, it is not trivial to have a monitoring system that works and scales. The effort to deploy such a monitoring system is also not trivial at all. So what's been discussed at the time for improving the monitoring would be very much welcome, though not only for the API service: something to check the health of other daemons would be very much welcome. I'd very much would like to participate in a Yoga effort to improve the current situation, and contribute the best I can, though I'm not sure I'd be the best person to drive this... Is there anyone else willing to work on this? Hoping this message is helpful, Cheers, Thomas Goirand (zigo)

Show replies by date

Sean Mooney

17 Nov 17 Nov

5:13 a.m.

On Wed, 2021-11-17 at 10:22 +0100, Thomas Goirand wrote:

...

Hi,

About a year and a half ago, I attempted to add /healthcheck support by default in all projects. For Nova, this resulted in this patch:

https://review.opendev.org/c/openstack/nova/+/724684

For other projects, it's been merged almost everywhere (I'd have to survey all project to see if that's the case, or if I still have debian specific patches somewhere).

Though for Nova, this sparked a discussion where it's been said that the current implementation of /healthcheck wasn't good enough. This resulted in threads about how to better do it.

Unfortunately, this blocked my patch from being merged in Nova.

It is my point of view to recognize a failure here. The /healthcheck URL was added in oslo.middleware so one can use it with something like haproxy to verify that the API is up, and responds. It was never designed to check, for example, if nova-api has a valid connectivity to MySQL and RabbitMQ. Yes, this is welcome, but in the mean time, operators must tweak the default file to have a valid, useable /etc/nova/api-paste.ini.

So I am hereby asking the nova team:

Can we please move forward and agree that 1.5 years waiting for such a minor patch is too long, and that such patch should be approved, prior to having a better healtcheck mechanism? I don't think it's a good idea to ask Nova users to wait potentially more development cycles to have a good-by-default api-paste.ini file. i am currently wokring on an alternitive solution for this cycle. i still belive it woudl be incorrect to add teh healtcheck provided by oslo.middelware to nova. we disucssed this at the ptg this cycel and still did nto think it was the correct way to approch this but we did agree to work on adding an alternitive form of health checks this cycle. i fundementally belive bad healthchecks are worse then no helatch checks and the olso midelware provides bad healthchecks.

since the /healthcheck denpoint can be added via api-paste.ini manually i dont think we shoudl add it to our default or that packageagre shoudl either. one open question in my draft spec is for the nova api in particaly should we support /healtcheck on the normal api port instead of the dedeicated health check endpoint.

...

At the same time, I am wondering: is anyone even working on a better healthcheck system? I haven't heard that anyone is working on this.

yes so i need to push the spec for review ill see if i can do that today or at a minium this week. the tldr is as follows. nova will be extended with 2 addtional options to allow a health checks endpoint to be exposed on a tcp port and/or a unix socket. these heatlth check endpoints will not be authenticated will be disabel by default. all nova binaries (nova-api, nova-schduler, nova-compute, ...) will supprot exposing the endpoint. the process will internally update a heathcheck data structure when ever they perform specific operation that can be uses as a proxy for the healt of the binary (db query, rpc ping, request to libvirt) these will be binary specific. The over all health will be summerised with a status enum, exact values to be determind but im working with (OK, DEGRADED, FAULT) for now. in the degraded and fault state there will also be a mesage and likely details filed in the respocne. message would be human readable with detail being the actual content of the health check data structure. i have not decided if i should use http status codes as part of the way to singal the status, my instinct are saying no parsing the json reponce shoudl be simple and if you just need to check the status filed for ok|degreated|falut using a 5XX error code in the degraded of fault case would not be semanticly correct. the current set of usecases i am using to drive the desting of the spec are as follows. Use Cases --------- As a operator i want a simple health-check i can consume to know if a nova process is OK, Degraded or Faulty. As an operator i want this health-check to not impact performance of the service so it can be queried frequently at short intervals. As a deployment tool implementer i want the health check to be local with no dependencies on other hosts or services to function so i can integrate it with service managers such as systemd or container runtime like docker As a packager i would like health-check to not require special client or packages consume them. CURL, socat or netcat should be all that is required to connect to the health check and retrieve the service status. As an operator i would like to be able to use health-check of the nova api and metadata services to manage the membership of endpoints in my load-balancer or reverse proxy automatically.

...

Though it would be more than welcome. Currently, to check that a daemon is alive and well, operators are stuck with:

- checking with ss if the daemon is correctly connected to a given port - check the logs for rabbitmq and mysql errors (with something like filebeat + elastic search and alarming)

Clearly, this doesn't scale. When running many large OpenStack clusters, it is not trivial to have a monitoring system that works and scales. The effort to deploy such a monitoring system is also not trivial at all. So what's been discussed at the time for improving the monitoring would be very much welcome, though not only for the API service: something to check the health of other daemons would be very much welcome.

I'd very much would like to participate in a Yoga effort to improve the current situation, and contribute the best I can, though I'm not sure I'd be the best person to drive this... Is there anyone else willing to work on this?

yep i am feel free to ping me on irc: sean-k-mooney incase your wondering but we have talked before. i have not configured my defualt channels since the change to oftc but im alwasy in at least #openstack-nova after discussing this in the nova ptg session the design took a hard right turn from being based on a rpc like protocaol exposed over a unix socket with ovos as the data fromat and active probes to a http based endpoint, avaiable over tcp and or unix socket with json as the responce format and a semi global data stucutre with TTL for the data. as a result i have had to rethink and rework most of the draft spec i had prepared. The main point of design that we need to agree on is exactuly how that data stucture is accessed and wehre it is stored. in the orginal desing i proposed there was no need to store any kind of state and or modify existing functions to add healchecks. each nova service manager would just implemant a new healthcheck function that would be pass as a callback to the healtcheck manager which exposed the endpoint. With the new approch we will like add decorators to imporant functions that will update the healthchecks based on if that fucntion complete correctly. if we take the decorator because of how decorators work it can only access module level varables, class method/memeber or the parmaters to the function it is decorating. what that efffectivly means is either the health check manager need to be stored in a module level "global" variable, it need to be a signelton accessable via a class method or it need to be stored in a data stucure that is passed to almost ever funciton speicifcally the context object. i am leaning towards the context object but i need to understand how that will interact with RPC calls so it might end up being a global/singelton which sucks form a unit/fucntional testing perspective but we can make it work via fixtures. hopefully this sould like good news to you but feel free to give feedback.

...

Hoping this message is helpful, Cheers,

Thomas Goirand (zigo)

Thomas Goirand

8:03 a.m.

Hi Sean, thanks for your reply! On 11/17/21 2:13 PM, Sean Mooney wrote:

...

i am currently wokring on an alternitive solution for this cycle.

gr8!

...

i still belive it woudl be incorrect to add teh healtcheck provided by oslo.middelware to nova. we disucssed this at the ptg this cycel and still did nto think it was the correct way to approch this but we did agree to work on adding an alternitive form of health checks this cycle. i fundementally belive bad healthchecks are worse then no helatch checks and the olso midelware provides bad healthchecks.

The current implementation is only useful for plugging haproxy to APIs, nothing more, nothing less.

...

since the /healthcheck denpoint can be added via api-paste.ini manually i dont think we shoudl add it to our default or that packageagre shoudl either.

Like it or not, the current state of things is: - /healthcheck is activated everywhere (I patched that myself) - The nova package at least in Debian has it activated by default (as this is the only project that refused the patch, I carry it in the package). Also, many operators already use the /healthcheck in production, so you really want to keep it. IMO, your implementation should switch to a different endpoint if you wish to not retain compatibility with the older system. For this reason, I strongly believe that the Nova team should be revising its view from a year and a half, and accept the imperfect currently implemented /healthcheck. This is not mutually exclusive to a better implementation bound on some other URL.

...

one open question in my draft spec is for the nova api in particaly should we support /healtcheck on the normal api port instead of the dedeicated health check endpoint.

You should absolutely not break backward compatibility!!!

...

yes so i need to push the spec for review ill see if i can do that today or at a minium this week. the tldr is as follows.

nova will be extended with 2 addtional options to allow a health checks endpoint to be exposed on a tcp port and/or a unix socket. these heatlth check endpoints will not be authenticated will be disabel by default. all nova binaries (nova-api, nova-schduler, nova-compute, ...) will supprot exposing the endpoint.

the process will internally update a heathcheck data structure when ever they perform specific operation that can be uses as a proxy for the healt of the binary (db query, rpc ping, request to libvirt) these will be binary specific.

The over all health will be summerised with a status enum, exact values to be determind but im working with (OK, DEGRADED, FAULT) for now. in the degraded and fault state there will also be a mesage and likely details filed in the respocne. message would be human readable with detail being the actual content of the health check data structure.

i have not decided if i should use http status codes as part of the way to singal the status, my instinct are saying no parsing the json reponce shoudl be simple and if you just need to check the status filed for ok|degreated|falut using a 5XX error code in the degraded of fault case would not be semanticly correct.

All you wrote above is great. For the http status codes, please implement it, because it's cheap, and that's how Zabbix (and probably other monitoring systems) works, plus everyone understand them.

...

Use Cases ---------

As a operator i want a simple health-check i can consume to know if a nova process is OK, Degraded or Faulty.

As an operator i want this health-check to not impact performance of the service so it can be queried frequently at short intervals.

As a deployment tool implementer i want the health check to be local with no dependencies on other hosts or services to function so i can integrate it with service managers such as systemd or container runtime like docker

As a packager i would like health-check to not require special client or packages consume them. CURL, socat or netcat should be all that is required to connect to the health check and retrieve the service status.

As an operator i would like to be able to use health-check of the nova api and metadata services to manage the membership of endpoints in my load-balancer or reverse proxy automatically.

...
Though it would be more than welcome. Currently, to check that a daemon is alive and well, operators are stuck with:

- checking with ss if the daemon is correctly connected to a given port - check the logs for rabbitmq and mysql errors (with something like filebeat + elastic search and alarming)

Clearly, this doesn't scale. When running many large OpenStack clusters, it is not trivial to have a monitoring system that works and scales. The effort to deploy such a monitoring system is also not trivial at all. So what's been discussed at the time for improving the monitoring would be very much welcome, though not only for the API service: something to check the health of other daemons would be very much welcome.

I'd very much would like to participate in a Yoga effort to improve the current situation, and contribute the best I can, though I'm not sure I'd be the best person to drive this... Is there anyone else willing to work on this?

yep i am feel free to ping me on irc: sean-k-mooney incase your wondering but we have talked before.

Yes. Feel free to ping me as well, I'll enjoy contributing were I can (though I know you're more skilled than I do in OpenStack's Python code... I'll still do what I can).

...

i have not configured my defualt channels since the change to oftc but im alwasy in at least #openstack-nova after discussing this in the nova ptg session the design took a hard right turn from being based on a rpc like protocaol exposed over a unix socket with ovos as the data fromat and active probes to a http based endpoint, avaiable over tcp and or unix socket with json as the responce format and a semi global data stucutre with TTL for the data.

as a result i have had to rethink and rework most of the draft spec i had prepared. The main point of design that we need to agree on is exactuly how that data stucture is accessed and wehre it is stored.

in the orginal desing i proposed there was no need to store any kind of state and or modify existing functions to add healchecks. each nova service manager would just implemant a new healthcheck function that would be pass as a callback to the healtcheck manager which exposed the endpoint.

With the new approch we will like add decorators to imporant functions that will update the healthchecks based on if that fucntion complete correctly. if we take the decorator because of how decorators work it can only access module level varables, class method/memeber or the parmaters to the function it is decorating. what that efffectivly means is either the health check manager need to be stored in a module level "global" variable, it need to be a signelton accessable via a class method or it need to be stored in a data stucure that is passed to almost ever funciton speicifcally the context object.

i am leaning towards the context object but i need to understand how that will interact with RPC calls so it might end up being a global/singelton which sucks form a unit/fucntional testing perspective but we can make it work via fixtures.

hopefully this sould like good news to you but feel free to give feedback.

I don't like the fact that we're still having the discussion 1.5 years after the proposed patch, and that still delays having Nova following what all the other projects have approved. Again, what you're doing should not be mutually exclusive with adding what already works, and what is already in production. It's been said a year and a half ago, and it's still truth. A year and a half ago, we even discuss the fact it would be a shame if it took more than a year... So can we move forward? Anyways, I'm excited that this goes forward, so thanks again for leading this initiative. Cheers, Thomas Goirand (zigo)

Mohammed Naser

12:53 p.m.

I don't think we rely on /healthcheck -- there's nothing healthy about an API endpoint blindly returning a 200 OK. You might as well just hit / and accept 300 as a code and that's exactly the same behaviour. I support what Sean is bringing up here and I don't think it makes sense to have a noop /healthcheck that always gives a 200 OK...seems a bit useless imho On Wed, Nov 17, 2021 at 11:09 AM Thomas Goirand <zigo@debian.org> wrote:

...

Hi Sean, thanks for your reply!

On 11/17/21 2:13 PM, Sean Mooney wrote:

...
i am currently wokring on an alternitive solution for this cycle.

gr8!

...
i still belive it woudl be incorrect to add teh healtcheck provided by oslo.middelware to nova. we disucssed this at the ptg this cycel and still did nto think it was the correct way to approch this but we did agree to work on adding an alternitive form of health checks this cycle. i fundementally belive bad healthchecks are worse then no helatch checks and the olso midelware provides bad healthchecks.

The current implementation is only useful for plugging haproxy to APIs, nothing more, nothing less.

...
since the /healthcheck denpoint can be added via api-paste.ini manually i dont think we shoudl add it to our default or that packageagre shoudl either.

Like it or not, the current state of things is: - /healthcheck is activated everywhere (I patched that myself) - The nova package at least in Debian has it activated by default (as this is the only project that refused the patch, I carry it in the package).

Also, many operators already use the /healthcheck in production, so you really want to keep it. IMO, your implementation should switch to a different endpoint if you wish to not retain compatibility with the older system.

For this reason, I strongly believe that the Nova team should be revising its view from a year and a half, and accept the imperfect currently implemented /healthcheck. This is not mutually exclusive to a better implementation bound on some other URL.

...
one open question in my draft spec is for the nova api in particaly should we support /healtcheck on the normal api port instead of the dedeicated health check endpoint.

You should absolutely not break backward compatibility!!!

...
yes so i need to push the spec for review ill see if i can do that today or at a minium this week. the tldr is as follows.

nova will be extended with 2 addtional options to allow a health checks endpoint to be exposed on a tcp port and/or a unix socket. these heatlth check endpoints will not be authenticated will be disabel by default. all nova binaries (nova-api, nova-schduler, nova-compute, ...) will supprot exposing the endpoint.

the process will internally update a heathcheck data structure when ever they perform specific operation that can be uses as a proxy for the healt of the binary (db query, rpc ping, request to libvirt) these will be binary specific.

The over all health will be summerised with a status enum, exact values to be determind but im working with (OK, DEGRADED, FAULT) for now. in the degraded and fault state there will also be a mesage and likely details filed in the respocne. message would be human readable with detail being the actual content of the health check data structure.

i have not decided if i should use http status codes as part of the way to singal the status, my instinct are saying no parsing the json reponce shoudl be simple and if you just need to check the status filed for ok|degreated|falut using a 5XX error code in the degraded of fault case would not be semanticly correct.

All you wrote above is great. For the http status codes, please implement it, because it's cheap, and that's how Zabbix (and probably other monitoring systems) works, plus everyone understand them.

...
Use Cases ---------

As a operator i want a simple health-check i can consume to know if a nova process is OK, Degraded or Faulty.

As an operator i want this health-check to not impact performance of the service so it can be queried frequently at short intervals.

As a deployment tool implementer i want the health check to be local with no dependencies on other hosts or services to function so i can integrate it with service managers such as systemd or container runtime like docker

As a packager i would like health-check to not require special client or packages consume them. CURL, socat or netcat should be all that is required to connect to the health check and retrieve the service status.

As an operator i would like to be able to use health-check of the nova api and metadata services to manage the membership of endpoints in my load-balancer or reverse proxy automatically.

...
Though it would be more than welcome. Currently, to check that a daemon is alive and well, operators are stuck with:

- checking with ss if the daemon is correctly connected to a given port - check the logs for rabbitmq and mysql errors (with something like filebeat + elastic search and alarming)

Clearly, this doesn't scale. When running many large OpenStack clusters, it is not trivial to have a monitoring system that works and scales. The effort to deploy such a monitoring system is also not trivial at all. So what's been discussed at the time for improving the monitoring would be very much welcome, though not only for the API service: something to check the health of other daemons would be very much welcome.

I'd very much would like to participate in a Yoga effort to improve the current situation, and contribute the best I can, though I'm not sure I'd be the best person to drive this... Is there anyone else willing to work on this?

yep i am feel free to ping me on irc: sean-k-mooney incase your wondering but we have talked before.

Yes. Feel free to ping me as well, I'll enjoy contributing were I can (though I know you're more skilled than I do in OpenStack's Python code... I'll still do what I can).

...
i have not configured my defualt channels since the change to oftc but im alwasy in at least #openstack-nova after discussing this in the nova ptg session the design took a hard right turn from being based on a rpc like protocaol exposed over a unix socket with ovos as the data fromat and active probes to a http based endpoint, avaiable over tcp and or unix socket with json as the responce format and a semi global data stucutre with TTL for the data.

as a result i have had to rethink and rework most of the draft spec i had prepared. The main point of design that we need to agree on is exactuly how that data stucture is accessed and wehre it is stored.

in the orginal desing i proposed there was no need to store any kind of state and or modify existing functions to add healchecks. each nova service manager would just implemant a new healthcheck function that would be pass as a callback to the healtcheck manager which exposed the endpoint.

With the new approch we will like add decorators to imporant functions that will update the healthchecks based on if that fucntion complete correctly. if we take the decorator because of how decorators work it can only access module level varables, class method/memeber or the parmaters to the function it is decorating. what that efffectivly means is either the health check manager need to be stored in a module level "global" variable, it need to be a signelton accessable via a class method or it need to be stored in a data stucure that is passed to almost ever funciton speicifcally the context object.

i am leaning towards the context object but i need to understand how that will interact with RPC calls so it might end up being a global/singelton which sucks form a unit/fucntional testing perspective but we can make it work via fixtures.

hopefully this sould like good news to you but feel free to give feedback.

I don't like the fact that we're still having the discussion 1.5 years after the proposed patch, and that still delays having Nova following what all the other projects have approved.

Again, what you're doing should not be mutually exclusive with adding what already works, and what is already in production. It's been said a year and a half ago, and it's still truth. A year and a half ago, we even discuss the fact it would be a shame if it took more than a year... So can we move forward?

Anyways, I'm excited that this goes forward, so thanks again for leading this initiative.

Cheers,

Thomas Goirand (zigo)

-- Mohammed Naser VEXXHOST, Inc.

Dan Smith

1:54 p.m.

...

I don't think we rely on /healthcheck -- there's nothing healthy about an API endpoint blindly returning a 200 OK.

You might as well just hit / and accept 300 as a code and that's exactly the same behaviour. I support what Sean is bringing up here and I don't think it makes sense to have a noop /healthcheck that always gives a 200 OK...seems a bit useless imho

Yup, totally agree. Our previous concerns over a healthcheck that checked all of nova returning too much info to be useful (for something trying to figure out if an individual worker is healthy) apply in reverse to one that returns too little to be useful. I agree, what Sean is working on is the right balance and that we should focus on that. --Dan

Thomas Goirand

2:47 p.m.

On 11/17/21 10:54 PM, Dan Smith wrote:

...

...
I don't think we rely on /healthcheck -- there's nothing healthy about an API endpoint blindly returning a 200 OK.

You might as well just hit / and accept 300 as a code and that's exactly the same behaviour. I support what Sean is bringing up here and I don't think it makes sense to have a noop /healthcheck that always gives a 200 OK...seems a bit useless imho

Yup, totally agree. Our previous concerns over a healthcheck that checked all of nova returning too much info to be useful (for something trying to figure out if an individual worker is healthy) apply in reverse to one that returns too little to be useful.

I agree, what Sean is working on is the right balance and that we should focus on that.

--Dan

That's not the only thing it does. It also is capable of being disabled, which is useful for maintenance: one can gracefully remove an API node for removal this way, which one cannot do with the root. Cheers, Thomas Goirand (zigo)

Mohammed Naser

5:03 p.m.

On Wed, Nov 17, 2021 at 5:52 PM Thomas Goirand <zigo@debian.org> wrote:

...

On 11/17/21 10:54 PM, Dan Smith wrote:

...
...
I don't think we rely on /healthcheck -- there's nothing healthy about an API endpoint blindly returning a 200 OK.

You might as well just hit / and accept 300 as a code and that's exactly the same behaviour. I support what Sean is bringing up here and I don't think it makes sense to have a noop /healthcheck that always gives a 200 OK...seems a bit useless imho

Yup, totally agree. Our previous concerns over a healthcheck that checked all of nova returning too much info to be useful (for something trying to figure out if an individual worker is healthy) apply in reverse to one that returns too little to be useful.

I agree, what Sean is working on is the right balance and that we should focus on that.

--Dan

That's not the only thing it does. It also is capable of being disabled, which is useful for maintenance: one can gracefully remove an API node for removal this way, which one cannot do with the root.

I feel like this should be handled by whatever layer that needs to drain requests for maintenance, otherwise also it might just be the same as turning off the service, no?

...

Cheers,

Thomas Goirand (zigo)

-- Mohammed Naser VEXXHOST, Inc.

Thomas Goirand

18 Nov 18 Nov

7:50 a.m.

On 11/18/21 2:03 AM, Mohammed Naser wrote:

...

On Wed, Nov 17, 2021 at 5:52 PM Thomas Goirand <zigo@debian.org <mailto:zigo@debian.org>> wrote:

On 11/17/21 10:54 PM, Dan Smith wrote: >> I don't think we rely on /healthcheck -- there's nothing healthy about >> an API endpoint blindly returning a 200 OK. >> >> You might as well just hit / and accept 300 as a code and that's >> exactly the same behaviour. I support what Sean is bringing up here >> and I don't think it makes sense to have a noop /healthcheck that >> always gives a 200 OK...seems a bit useless imho > > Yup, totally agree. Our previous concerns over a healthcheck that > checked all of nova returning too much info to be useful (for something > trying to figure out if an individual worker is healthy) apply in > reverse to one that returns too little to be useful. > > I agree, what Sean is working on is the right balance and that we should > focus on that. > > --Dan >

That's not the only thing it does. It also is capable of being disabled, which is useful for maintenance: one can gracefully remove an API node for removal this way, which one cannot do with the root.

I feel like this should be handled by whatever layer that needs to drain requests for maintenance, otherwise also it might just be the same as turning off the service, no?

It's not the same. If you just turn off the service, there well may be some requests attempted to the API before it's seen as down. The idea here, is to declare the API as down, so that haproxy can remove it from the pool *before* the service is really turned off. That's what the oslo.middleware disable file helps doing, which the root url cannot do. Cheers, Thomas Goirand (zigo)

Tobias Urdin

19 Nov 19 Nov

7:35 a.m.

As Mohammed said, you can actually do the exact same in haproxy by setting the server in the backend to drain which would be the same just the opposite way around. That is “set server <backend>/<server> state drain” over haproxy admin socket. I really welcome Sean’s proposal on a real healthcheck framework that would actually tell you that something is not working instead of trying to find for example RabbitMQ connection issues from logs, it really is a pain. I wouldn’t want to have an “real” healthcheck that does all these things exposed on public API though and think Sean’s proposal is correct and does not break backward capability since oslo.healthcheck middleware will still be there. Best regards Tobias

...

On 18 Nov 2021, at 16:50, Thomas Goirand <zigo@debian.org> wrote:

On 11/18/21 2:03 AM, Mohammed Naser wrote:

...
On Wed, Nov 17, 2021 at 5:52 PM Thomas Goirand <zigo@debian.org <mailto:zigo@debian.org>> wrote:

On 11/17/21 10:54 PM, Dan Smith wrote:

...
...
I don't think we rely on /healthcheck -- there's nothing healthy about an API endpoint blindly returning a 200 OK.

You might as well just hit / and accept 300 as a code and that's exactly the same behaviour. I support what Sean is bringing up here and I don't think it makes sense to have a noop /healthcheck that always gives a 200 OK...seems a bit useless imho

Yup, totally agree. Our previous concerns over a healthcheck that checked all of nova returning too much info to be useful (for something trying to figure out if an individual worker is healthy) apply in reverse to one that returns too little to be useful.

I agree, what Sean is working on is the right balance and that we should focus on that.

--Dan

That's not the only thing it does. It also is capable of being disabled, which is useful for maintenance: one can gracefully remove an API node for removal this way, which one cannot do with the root.

I feel like this should be handled by whatever layer that needs to drain requests for maintenance, otherwise also it might just be the same as turning off the service, no?

It's not the same.

If you just turn off the service, there well may be some requests attempted to the API before it's seen as down. The idea here, is to declare the API as down, so that haproxy can remove it from the pool *before* the service is really turned off.

That's what the oslo.middleware disable file helps doing, which the root url cannot do.

Cheers,

Thomas Goirand (zigo)

Ghanshyam Mann

17 Nov 17 Nov

4:42 p.m.

---- On Wed, 17 Nov 2021 15:54:49 -0600 Dan Smith <dms@danplanet.com> wrote ----

...

...
I don't think we rely on /healthcheck -- there's nothing healthy about an API endpoint blindly returning a 200 OK.

You might as well just hit / and accept 300 as a code and that's exactly the same behaviour. I support what Sean is bringing up here and I don't think it makes sense to have a noop /healthcheck that always gives a 200 OK...seems a bit useless imho

Yup, totally agree. Our previous concerns over a healthcheck that checked all of nova returning too much info to be useful (for something trying to figure out if an individual worker is healthy) apply in reverse to one that returns too little to be useful.

True, we can see the example in this old patch PS1 trying to implement all the Nova_DB_healthcheck, Nova_MQ_healthcheck, Nova_services_healthcheck and end up a lot of info and time-consuming process - https://review.opendev.org/c/openstack/nova/+/731396/1 and then on RPC call success in PS2 - https://review.opendev.org/c/openstack/nova/+/731396/2 I agree on the point that heathchecks should be 'very Confirmed things saying it is healthy' otherwise, it just solves the HA proxy use case and rests all use cases will consider this as bad healthcheck which is the current case of solo middleware. -gmann

...

I agree, what Sean is working on is the right balance and that we should focus on that.

--Dan

1427

Age (days ago)

1429

Last active (days ago)

List overview

Download

9 comments

6 participants

participants (6)

Dan Smith
Ghanshyam Mann
Mohammed Naser
Sean Mooney
Thomas Goirand
Tobias Urdin

[nova][all] Adding /healthcheck support in Nova, and better healthcheck in every projects

tags

participants (6)