[neutron][ops] API for viewing HA router states
Mohammed Naser
mnaser at vexxhost.com
Wed Aug 19 02:31:17 UTC 2020
On Tue, Aug 18, 2020 at 8:12 AM Jonas Schäfer
<jonas.schaefer at cloudandheat.com> wrote:
>
> Hi Mohammed and all,
>
> On Montag, 17. August 2020 14:01:55 CEST Mohammed Naser wrote:
> > Over the past few days, we were troubleshooting an issue that ended up
> > having a root cause where keepalived has somehow ended up active in
> > two different L3 agents. We've yet to find the root cause of how this
> > happened but removing it and adding it resolved the issue for us.
>
> We’ve also seen that behaviour occasionally. The root cause is also unclear
> for us (so we would’ve love to hear about that). We have anecdotal evidence
> that a rabbitmq failure was involved, although that makes no sense to me
> personally. Other causes may be incorrectly cleaned-up namespaces (for
> example, when you kill or hard-restart the l3 agent, the namespaces will stay
> around, possibly with the IP address assigned; the keepalived on the other l3
> agents will not see the VRRP advertisments anymore and will ALSO assign the IP
> address. This will also be rectified by a restart always and may require
> manual namespace cleanup with a tool, a node reboot or an agent disable/enable
> cycle.).
>
> > As we work on improving our monitoring, we wanted to implement
> > something that gets us the info of # of active routers to check if
> > there's a router that has >1 active L3 agent but it's hard because
> > hitting the /l3-agents endpoint on _every_ single router hurts a lot
> > on performance.
> >
> > Is there something else that we can watch which might be more
> > productive? FYI -- this all goes in the open and will end up inside
> > the openstack-exporter:
> > https://github.com/openstack-exporter/openstack-exporter and the Helm
> > charts will end up with the alerts:
> > https://github.com/openstack-exporter/helm-charts
>
> While I don’t think it fits in your openstack-exporter design, we are
> currently using the attached script (which we also hereby publish under the
> terms of the Apache 2.0 license [1]). (Sorry, I lack the time to cleanly
> publish it somewhere right now.)
>
> It checks the state files maintained by the L3 agent conglomerate and exports
> metrics about the master-ness of the routers as prometheus metrics.
>
> Note that this is slightly dangerous since the router IDs are high-cardinality
> and using that as a label value in Prometheus is discouraged; you may not want
> to do this in a public cloud setting.
>
> Either way: This allows us to alert on routers where there is not exactly one
> master state. Downside is that this requires the thing to run locally on the
> l3 agent nodes. Upside is that it is very efficient, and will also show the
> master state in some cases where the router was not cleaned up properly (e.g.
> because the l3 agent and its keepaliveds were killed).
This seems sweet. Let me go over the code. I might package this up
into something
consumable and host it inside OpenDev, if that's okay with you?
> kind regards,
> Jonas
>
> [1]: http://www.apache.org/licenses/LICENSE-2.0
> --
> Jonas Schäfer
> DevOps Engineer
>
> Cloud&Heat Technologies GmbH
> Königsbrücker Straße 96 | 01099 Dresden
> +49 351 479 367 37
> jonas.schaefer at cloudandheat.com | www.cloudandheat.com
>
> New Service:
> Managed Kubernetes designed for AI & ML
> https://managed-kubernetes.cloudandheat.com/
>
> Commercial Register: District Court Dresden
> Register Number: HRB 30549
> VAT ID No.: DE281093504
> Managing Director: Nicolas Röhrs
> Authorized signatory: Dr. Marius Feldmann
> Authorized signatory: Kristina Rübenkamp
--
Mohammed Naser
VEXXHOST, Inc.
More information about the openstack-discuss
mailing list