[neutron][ops] API for viewing HA router states

Jonas Schäfer jonas.schaefer at cloudandheat.com
Tue Aug 18 12:08:42 UTC 2020


Hi Mohammed and all,

On Montag, 17. August 2020 14:01:55 CEST Mohammed Naser wrote:
> Over the past few days, we were troubleshooting an issue that ended up
> having a root cause where keepalived has somehow ended up active in
> two different L3 agents.  We've yet to find the root cause of how this
> happened but removing it and adding it resolved the issue for us.

We’ve also seen that behaviour occasionally. The root cause is also unclear 
for us (so we would’ve love to hear about that). We have anecdotal evidence 
that a rabbitmq failure was involved, although that makes no sense to me 
personally. Other causes may be incorrectly cleaned-up namespaces (for 
example, when you kill or hard-restart the l3 agent, the namespaces will stay 
around, possibly with the IP address assigned; the keepalived on the other l3 
agents will not see the VRRP advertisments anymore and will ALSO assign the IP 
address. This will also be rectified by a restart always and may require 
manual namespace cleanup with a tool, a node reboot or an agent disable/enable 
cycle.). 

> As we work on improving our monitoring, we wanted to implement
> something that gets us the info of # of active routers to check if
> there's a router that has >1 active L3 agent but it's hard because
> hitting the /l3-agents endpoint on _every_ single router hurts a lot
> on performance.
> 
> Is there something else that we can watch which might be more
> productive?  FYI -- this all goes in the open and will end up inside
> the openstack-exporter:
> https://github.com/openstack-exporter/openstack-exporter and the Helm
> charts will end up with the alerts:
> https://github.com/openstack-exporter/helm-charts

While I don’t think it fits in your openstack-exporter design, we are 
currently using the attached script (which we also hereby publish under the 
terms of the Apache 2.0 license [1]). (Sorry, I lack the time to cleanly 
publish it somewhere right now.)

It checks the state files maintained by the L3 agent conglomerate and exports 
metrics about the master-ness of the routers as prometheus metrics.

Note that this is slightly dangerous since the router IDs are high-cardinality 
and using that as a label value in Prometheus is discouraged; you may not want 
to do this in a public cloud setting.

Either way: This allows us to alert on routers where there is not exactly one 
master state. Downside is that this requires the thing to run locally on the 
l3 agent nodes. Upside is that it is very efficient, and will also show the 
master state in some cases where the router was not cleaned up properly (e.g. 
because the l3 agent and its keepaliveds were killed).

kind regards,
Jonas

   [1]: http://www.apache.org/licenses/LICENSE-2.0
-- 
Jonas Schäfer
DevOps Engineer

Cloud&Heat Technologies GmbH
Königsbrücker Straße 96 | 01099 Dresden
+49 351 479 367 37
jonas.schaefer at cloudandheat.com | www.cloudandheat.com

New Service:
Managed Kubernetes designed for AI & ML
https://managed-kubernetes.cloudandheat.com/

Commercial Register: District Court Dresden
Register Number: HRB 30549
VAT ID No.: DE281093504
Managing Director: Nicolas Röhrs
Authorized signatory: Dr. Marius Feldmann
Authorized signatory: Kristina Rübenkamp
-------------- next part --------------
A non-text attachment was scrubbed...
Name: os_l3_router_exporter.py
Type: text/x-python3
Size: 1780 bytes
Desc: not available
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20200818/97f65ca1/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20200818/97f65ca1/attachment.sig>


More information about the openstack-discuss mailing list