[self-healing-sig] best practices for haproxy health checking

Dirk Müller

11 Jan 2019 11 Jan '19

9:11 a.m.

Hi, Does anyone have a good pointer for good healthchecks to be used by the frontend api haproxy loadbalancer? in one case that I am looking at right now, the entry haproxy loadbalancer was not able to detect a particular backend being not responding to api requests, so it flipped up and down repeatedly, causing intermittend spurious 503 errors. The backend was able to respond to connections and to basic HTTP GET requests (e.g. / or even /v3 as path), but when it got a "real" query it hung. the reason for that was, as it turned out, the configured caching backend memcached on that machine being locked up (due to some other bug). I wonder if there is a better way to check if a backend is "working" and what the best practices around this are. A potential thought I had was to do the backend check via some other healthcheck specific port that runs a custom daemon that does more sophisticated checks like checking for system wide errors (like memcache, database, rabbitmq) being unavailable on that node, and hence not accepting any api traffic until that is being resolved. Any pointers to read upon / best practices appreciated. Thanks, Dirk

Show replies by date

Ben Nemec

11 Jan 11 Jan

9:31 a.m.

On 1/11/19 11:11 AM, Dirk Müller wrote:

...

Hi,

Does anyone have a good pointer for good healthchecks to be used by the frontend api haproxy loadbalancer?

in one case that I am looking at right now, the entry haproxy loadbalancer was not able to detect a particular backend being not responding to api requests, so it flipped up and down repeatedly, causing intermittend spurious 503 errors.

The backend was able to respond to connections and to basic HTTP GET requests (e.g. / or even /v3 as path), but when it got a "real" query it hung. the reason for that was, as it turned out, the configured caching backend memcached on that machine being locked up (due to some other bug).

I wonder if there is a better way to check if a backend is "working" and what the best practices around this are. A potential thought I had was to do the backend check via some other healthcheck specific port that runs a custom daemon that does more sophisticated checks like checking for system wide errors (like memcache, database, rabbitmq) being unavailable on that node, and hence not accepting any api traffic until that is being resolved.

A very similar thing has been proposed: https://review.openstack.org/#/c/531456/ It also came up as a possible community goal for Train: http://lists.openstack.org/pipermail/openstack-discuss/2018-December/000558.... But to my knowledge no one has stepped forward to drive the work. It seems to be something people generally agree we need, but nobody has time to do. :-(

...

Any pointers to read upon / best practices appreciated.

Thanks, Dirk

Adam Spiers

15 Jan 15 Jan

9:01 a.m.

New subject: [self-healing-sig][api-sig][train] best practices for haproxy health checking

Ben Nemec <openstack@nemebean.com> wrote:

...

On 1/11/19 11:11 AM, Dirk Müller wrote:

...
Does anyone have a good pointer for good healthchecks to be used by the frontend api haproxy loadbalancer?

Great question, thanks ;-) This is exactly the kind of discussion I believe is worth encouraging within the self-healing SIG context.

...

...
in one case that I am looking at right now, the entry haproxy loadbalancer was not able to detect a particular backend being not responding to api requests, so it flipped up and down repeatedly, causing intermittend spurious 503 errors.

The backend was able to respond to connections and to basic HTTP GET requests (e.g. / or even /v3 as path), but when it got a "real" query it hung. the reason for that was, as it turned out, the configured caching backend memcached on that machine being locked up (due to some other bug).

I wonder if there is a better way to check if a backend is "working" and what the best practices around this are. A potential thought I had was to do the backend check via some other healthcheck specific port that runs a custom daemon that does more sophisticated checks like checking for system wide errors (like memcache, database, rabbitmq) being unavailable on that node, and hence not accepting any api traffic until that is being resolved.

A very similar thing has been proposed: https://review.openstack.org/#/c/531456/

This is definitely relevant, although it's a slightly different approach to the same problem, where the backend API service itself would perform checks internally, rather than relying on something external to it evaluating its health. IMHO the former makes slightly more sense, because the API service knows exactly what its dependencies are and can easily check the health of things like a database connection. Having said that, of course there is also benefit to black-box monitoring.

...

It also came up as a possible community goal for Train: http://lists.openstack.org/pipermail/openstack-discuss/2018-December/000558....

Right. Here's the story: https://storyboard.openstack.org/#!/story/2001439 IIRC, the latest consensus reached in Denver included the following points: - We should initially do the simplest thing which could possibly work. - Each API should only perform shallow health checks on its dependencies (e.g. nova-api shouldn't perform extensive functional checks on other nova services), but deeper health checks on its internals are fine (e.g. that it can reach the database / message queue / memcached). Then we can use Vitrage for root cause analysis. I would like to suggest one immediate concrete action we should take on this particular haproxy scenario, which is to submit a corresponding use case to the self-healing SIG doc repo. This should help share any existing best practices (or gaps thereof) across the whole community, as a starting point which anyone is welcome to jump on board. I'm happy to do this, or since I happen to be in the same office as Dirk for the rest of this week, maybe we can even co-author it together :-)

...

But to my knowledge no one has stepped forward to drive the work. It seems to be something people generally agree we need, but nobody has time to do. :-(

I'm actually very enthusiastic about the idea of taking this on myself, but cannot promise anything until I've had the relevant conversations with my employer this week ...

2405

Age (days ago)

2409

Last active (days ago)

List overview

Download

2 comments

3 participants

participants (3)

Adam Spiers
Ben Nemec
Dirk Müller