[neutron][ops] API for viewing HA router states
Jonas Schäfer
jonas.schaefer at cloudandheat.com
Tue Aug 18 12:08:42 UTC 2020
Hi Mohammed and all,
On Montag, 17. August 2020 14:01:55 CEST Mohammed Naser wrote:
> Over the past few days, we were troubleshooting an issue that ended up
> having a root cause where keepalived has somehow ended up active in
> two different L3 agents. We've yet to find the root cause of how this
> happened but removing it and adding it resolved the issue for us.
We’ve also seen that behaviour occasionally. The root cause is also unclear
for us (so we would’ve love to hear about that). We have anecdotal evidence
that a rabbitmq failure was involved, although that makes no sense to me
personally. Other causes may be incorrectly cleaned-up namespaces (for
example, when you kill or hard-restart the l3 agent, the namespaces will stay
around, possibly with the IP address assigned; the keepalived on the other l3
agents will not see the VRRP advertisments anymore and will ALSO assign the IP
address. This will also be rectified by a restart always and may require
manual namespace cleanup with a tool, a node reboot or an agent disable/enable
cycle.).
> As we work on improving our monitoring, we wanted to implement
> something that gets us the info of # of active routers to check if
> there's a router that has >1 active L3 agent but it's hard because
> hitting the /l3-agents endpoint on _every_ single router hurts a lot
> on performance.
>
> Is there something else that we can watch which might be more
> productive? FYI -- this all goes in the open and will end up inside
> the openstack-exporter:
> https://github.com/openstack-exporter/openstack-exporter and the Helm
> charts will end up with the alerts:
> https://github.com/openstack-exporter/helm-charts
While I don’t think it fits in your openstack-exporter design, we are
currently using the attached script (which we also hereby publish under the
terms of the Apache 2.0 license [1]). (Sorry, I lack the time to cleanly
publish it somewhere right now.)
It checks the state files maintained by the L3 agent conglomerate and exports
metrics about the master-ness of the routers as prometheus metrics.
Note that this is slightly dangerous since the router IDs are high-cardinality
and using that as a label value in Prometheus is discouraged; you may not want
to do this in a public cloud setting.
Either way: This allows us to alert on routers where there is not exactly one
master state. Downside is that this requires the thing to run locally on the
l3 agent nodes. Upside is that it is very efficient, and will also show the
master state in some cases where the router was not cleaned up properly (e.g.
because the l3 agent and its keepaliveds were killed).
kind regards,
Jonas
[1]: http://www.apache.org/licenses/LICENSE-2.0
--
Jonas Schäfer
DevOps Engineer
Cloud&Heat Technologies GmbH
Königsbrücker Straße 96 | 01099 Dresden
+49 351 479 367 37
jonas.schaefer at cloudandheat.com | www.cloudandheat.com
New Service:
Managed Kubernetes designed for AI & ML
https://managed-kubernetes.cloudandheat.com/
Commercial Register: District Court Dresden
Register Number: HRB 30549
VAT ID No.: DE281093504
Managing Director: Nicolas Röhrs
Authorized signatory: Dr. Marius Feldmann
Authorized signatory: Kristina Rübenkamp
-------------- next part --------------
A non-text attachment was scrubbed...
Name: os_l3_router_exporter.py
Type: text/x-python3
Size: 1780 bytes
Desc: not available
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20200818/97f65ca1/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20200818/97f65ca1/attachment.sig>
More information about the openstack-discuss
mailing list