[Openstack-operators] [monitoring][messaging][rpc] When your OpenStack app is dead

Bogdan Dobrelya bdobrelia at mirantis.com
Tue Aug 11 14:20:22 UTC 2015


There is known issue [0] in Oslo messaging and it seems resolved in
Kilo. But the UX of this one is very sad. For example, each time when
your AMQP cluster executed a single node failover and recovered running
happy, there is a chance some OpenStack apps, like Nova Compute, may
stuck in broken state and only a restat could help to heal them.

The typical log pattern for this broken state of a service is a "Timed
out waiting for reply". Hence, it may be a good idea to implement
monitoring filters based on that pattern and automatically set an alert
status for affected OpenStack services.

[0] https://bugs.launchpad.net/oslo.messaging/+bug/1338732

-- 
Best regards,
Bogdan Dobrelya,
Irc #bogdando



More information about the OpenStack-operators mailing list