[Openstack-operators] Should Healthcheck URLs for services be enabled by default?

Andy Botting andy at andybotting.com
Tue May 24 06:07:03 UTC 2016


Thanks to Simon, Josh and Kris who replied to my last email about the
healthcheck middlewear - these are now working well for us.

I'm sure there are plenty of operators, like us, who didn't know this
existed.

Is there any reason why they're not enabled by default?

cheers,
Andy

On 30 April 2016 at 11:52, Joshua Harlow <harlowja at fastmail.com> wrote:

> This can help u more easily view what the healthcheck middleware can also
> show (especially in detailed mode); it can show thread stacks and such
> which can be useful for debugging stuck servers and such (similar in
> concept to apache mod_status).
>
> https://review.openstack.org/#/c/311482/
>
> Run the above review like:
>
> $ python oslo_middleware/healthcheck/ -p 8000
>
> Then open a browser to http://127.0.0.1:8000/ (or other port).
>
> -Josh
>
>
> Joshua Harlow wrote:
>
>> Yup, that healthcheck middleware was made more advanced by me,
>>
>> If u need to do anything special with it, let me know and I can help
>> make that possible (or at least instruct what might need changed to do
>> that).
>>
>> Simon Pasquier wrote:
>>
>>> Hi,
>>>
>>> On Thu, Apr 28, 2016 at 5:13 AM, Andy Botting <andy at andybotting.com
>>> <mailto:andy at andybotting.com>> wrote:
>>>
>>> We're running our services clustered behind an F5 loadbalancer in
>>> production, and haproxy in our testing environment. This setup works
>>> quite well for us, but I'm not that happy with testing the health of
>>> our endpoints.
>>>
>>> We're currently calling basic URLs like / or /v2 etc and some
>>> services return a 200, some return other codes like 401. Our
>>> healthcheck test simply checks against whatever the http code
>>> returns. This works OK and does catch basic service failure.
>>>
>>> Our test environment is on flaky hardware and often fails in strange
>>> ways and sometimes the port is open and basic URLs work, but
>>> actually doing real API calls fail and timeout, so our checks fall
>>> down here.
>>>
>>> In a previous role I had, the developers added a url (e.g.
>>> /healthcheck) to each web application which went through and tested
>>> things like the db connection was OK, memcached was accessible, etc
>>> and returned a 200. This worked out really great for operations. I
>>> haven't seen anything like this for OpenStack.
>>>
>>>
>>> There's a healthcheck oslo.middleware plugin [1] available. So you could
>>> possibly configure the service pipeline to include this except it won't
>>> exercise the db connection, RabbitMQ connection, and so on. But it would
>>> help if you want to kick out a service instance from the load-balancer
>>> without stopping the service completely [2].
>>>
>>> [1]
>>>
>>> http://docs.openstack.org/developer/oslo.middleware/healthcheck_plugins.html
>>>
>>> [2]
>>>
>>> http://docs.openstack.org/developer/oslo.middleware/healthcheck_plugins.html#disable-by-file
>>>
>>>
>>> I'm wondering how everyone else does healthchecking of their
>>> clustered services, and whether or not they think adding a dedicated
>>> heathcheck URL would be beneficial?
>>>
>>>
>>> From what I can tell, people are doing the same thing as you do: check
>>> that a well-known location ('/', '/v2' or else) returns the expected
>>> code and hope that it will work for real user requests too.
>>>
>>> Simon
>>>
>>>
>>> We do use scripts similar to ones in the osops-tools-monitoring in
>>> Nagios which help with more complex testing, but I'm thinking of
>>> something more lightweight specifically for setting up on loadbalancers.
>>>
>>> cheers,
>>> Andy
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20160524/6ad43fd6/attachment.html>


More information about the OpenStack-operators mailing list