[self-healing-sig][api-sig][train] best practices for haproxy health checking
Adam Spiers
aspiers at suse.com
Tue Jan 15 17:01:02 UTC 2019
Ben Nemec <openstack at nemebean.com> wrote:
>On 1/11/19 11:11 AM, Dirk Müller wrote:
>>Does anyone have a good pointer for good healthchecks to be used by
>>the frontend api haproxy loadbalancer?
Great question, thanks ;-) This is exactly the kind of discussion
I believe is worth encouraging within the self-healing SIG context.
>>in one case that I am looking at right now, the entry haproxy
>>loadbalancer was not able
>>to detect a particular backend being not responding to api requests,
>>so it flipped up and down repeatedly, causing intermittend spurious
>>503 errors.
>>
>>The backend was able to respond to connections and to basic HTTP GET
>>requests (e.g. / or even /v3 as path), but when it got a "real" query
>>it hung. the reason for that was, as it turned out,
>>the configured caching backend memcached on that machine being locked
>>up (due to some other bug).
>>
>>I wonder if there is a better way to check if a backend is "working"
>>and what the best practices around this are. A potential thought I had
>>was to do the backend check via some other healthcheck specific port
>>that runs a custom daemon that does more sophisticated checks like
>>checking for system wide errors (like memcache, database, rabbitmq)
>>being unavailable on that node, and hence not accepting any api
>>traffic until that is being resolved.
>
>A very similar thing has been proposed:
>https://review.openstack.org/#/c/531456/
This is definitely relevant, although it's a slightly different
approach to the same problem, where the backend API service itself
would perform checks internally, rather than relying on something
external to it evaluating its health. IMHO the former makes slightly
more sense, because the API service knows exactly what its
dependencies are and can easily check the health of things like a
database connection. Having said that, of course there is also
benefit to black-box monitoring.
>It also came up as a possible community goal for Train: http://lists.openstack.org/pipermail/openstack-discuss/2018-December/000558.html
Right. Here's the story:
https://storyboard.openstack.org/#!/story/2001439
IIRC, the latest consensus reached in Denver included the following
points:
- We should initially do the simplest thing which could possibly
work.
- Each API should only perform shallow health checks on its
dependencies (e.g. nova-api shouldn't perform extensive functional
checks on other nova services), but deeper health checks on its
internals are fine (e.g. that it can reach the database / message
queue / memcached). Then we can use Vitrage for root cause
analysis.
I would like to suggest one immediate concrete action we should take
on this particular haproxy scenario, which is to submit a
corresponding use case to the self-healing SIG doc repo. This should
help share any existing best practices (or gaps thereof) across the
whole community, as a starting point which anyone is welcome to jump
on board. I'm happy to do this, or since I happen to be in the same
office as Dirk for the rest of this week, maybe we can even co-author
it together :-)
>But to my knowledge no one has stepped forward to drive the work. It
>seems to be something people generally agree we need, but nobody has
>time to do. :-(
I'm actually very enthusiastic about the idea of taking this on
myself, but cannot promise anything until I've had the relevant
conversations with my employer this week ...
More information about the openstack-discuss
mailing list