[self-healing-sig][api-sig][train] best practices for haproxy health checking

Adam Spiers aspiers at suse.com
Tue Jan 15 17:01:02 UTC 2019


Ben Nemec <openstack at nemebean.com> wrote: 
>On 1/11/19 11:11 AM, Dirk Müller wrote: 
>>Does anyone have a good pointer for good healthchecks to be used by 
>>the frontend api haproxy loadbalancer? 

Great question, thanks ;-)  This is exactly the kind of discussion 
I believe is worth encouraging within the self-healing SIG context. 

>>in one case that I am looking at right now, the entry haproxy 
>>loadbalancer was not able 
>>to detect a particular backend being not responding to api requests, 
>>so it flipped up and down repeatedly, causing intermittend spurious 
>>503 errors. 
>>
>>The backend was able to respond to connections and to basic HTTP GET 
>>requests (e.g. / or even /v3 as path), but when it got a "real" query 
>>it hung. the reason for that was, as it turned out, 
>>the configured caching backend memcached on that machine being locked 
>>up (due to some other bug). 
>>
>>I wonder if there is a better way to check if a backend is "working" 
>>and what the best practices around this are. A potential thought I had 
>>was to do the backend check via some other healthcheck specific port 
>>that runs a custom daemon that does more sophisticated checks like 
>>checking for system wide errors (like memcache, database, rabbitmq) 
>>being unavailable on that node, and hence not accepting any api 
>>traffic until that is being resolved. 
>
>A very similar thing has been proposed: 
>https://review.openstack.org/#/c/531456/ 

This is definitely relevant, although it's a slightly different 
approach to the same problem, where the backend API service itself 
would perform checks internally, rather than relying on something 
external to it evaluating its health.  IMHO the former makes slightly 
more sense, because the API service knows exactly what its 
dependencies are and can easily check the health of things like a 
database connection.  Having said that, of course there is also 
benefit to black-box monitoring. 

>It also came up as a possible community goal for Train: http://lists.openstack.org/pipermail/openstack-discuss/2018-December/000558.html 

Right.  Here's the story: 

    https://storyboard.openstack.org/#!/story/2001439

IIRC, the latest consensus reached in Denver included the following 
points: 

  - We should initially do the simplest thing which could possibly
    work.

  - Each API should only perform shallow health checks on its
    dependencies (e.g. nova-api shouldn't perform extensive functional
    checks on other nova services), but deeper health checks on its
    internals are fine (e.g. that it can reach the database / message
    queue / memcached).  Then we can use Vitrage for root cause
    analysis.

I would like to suggest one immediate concrete action we should take 
on this particular haproxy scenario, which is to submit a 
corresponding use case to the self-healing SIG doc repo.  This should 
help share any existing best practices (or gaps thereof) across the 
whole community, as a starting point which anyone is welcome to jump 
on board.  I'm happy to do this, or since I happen to be in the same 
office as Dirk for the rest of this week, maybe we can even co-author 
it together :-) 

>But to my knowledge no one has stepped forward to drive the work. It 
>seems to be something people generally agree we need, but nobody has 
>time to do. :-( 

I'm actually very enthusiastic about the idea of taking this on 
myself, but cannot promise anything until I've had the relevant 
conversations with my employer this week ... 



More information about the openstack-discuss mailing list