[neutron][ops] API for viewing HA router states
Assaf Muller
amuller at redhat.com
Tue Aug 18 14:48:23 UTC 2020
On Tue, Aug 18, 2020 at 8:12 AM Jonas Schäfer
<jonas.schaefer at cloudandheat.com> wrote:
>
> Hi Mohammed and all,
>
> On Montag, 17. August 2020 14:01:55 CEST Mohammed Naser wrote:
> > Over the past few days, we were troubleshooting an issue that ended up
> > having a root cause where keepalived has somehow ended up active in
> > two different L3 agents. We've yet to find the root cause of how this
> > happened but removing it and adding it resolved the issue for us.
>
> We’ve also seen that behaviour occasionally. The root cause is also unclear
> for us (so we would’ve love to hear about that).
Insert shameless plug for the Neutron OVN backend. One of it's
advantages is that it's L3 HA architecture is cleaner and more
scalable (this is coming from the dude that wrote the L3 HA code we're
all suffering from =D). The ML2/OVS L3 HA architecture has it's issues
- I've seen it work at 100's of customer sites at scale, so I don't
want to knock it too much, but just a day ago I got an internal
customer ticket about keepalived falling over on a particular router
that has 200 floating IPs. It works but it's not perfect. I'm sure the
OVN implementation isn't either but it's simply cleaner and has less
moving parts. It uses BFD to monitor the tunnel endpoints, so failover
is faster too. Plus, it doesn't use keepalived.
> We have anecdotal evidence
> that a rabbitmq failure was involved, although that makes no sense to me
> personally. Other causes may be incorrectly cleaned-up namespaces (for
> example, when you kill or hard-restart the l3 agent, the namespaces will stay
> around, possibly with the IP address assigned; the keepalived on the other l3
> agents will not see the VRRP advertisments anymore and will ALSO assign the IP
> address. This will also be rectified by a restart always and may require
> manual namespace cleanup with a tool, a node reboot or an agent disable/enable
> cycle.).
>
> > As we work on improving our monitoring, we wanted to implement
> > something that gets us the info of # of active routers to check if
> > there's a router that has >1 active L3 agent but it's hard because
> > hitting the /l3-agents endpoint on _every_ single router hurts a lot
> > on performance.
> >
> > Is there something else that we can watch which might be more
> > productive? FYI -- this all goes in the open and will end up inside
> > the openstack-exporter:
> > https://github.com/openstack-exporter/openstack-exporter and the Helm
> > charts will end up with the alerts:
> > https://github.com/openstack-exporter/helm-charts
>
> While I don’t think it fits in your openstack-exporter design, we are
> currently using the attached script (which we also hereby publish under the
> terms of the Apache 2.0 license [1]). (Sorry, I lack the time to cleanly
> publish it somewhere right now.)
>
> It checks the state files maintained by the L3 agent conglomerate and exports
> metrics about the master-ness of the routers as prometheus metrics.
>
> Note that this is slightly dangerous since the router IDs are high-cardinality
> and using that as a label value in Prometheus is discouraged; you may not want
> to do this in a public cloud setting.
>
> Either way: This allows us to alert on routers where there is not exactly one
> master state. Downside is that this requires the thing to run locally on the
> l3 agent nodes. Upside is that it is very efficient, and will also show the
> master state in some cases where the router was not cleaned up properly (e.g.
> because the l3 agent and its keepaliveds were killed).
>
> kind regards,
> Jonas
>
> [1]: http://www.apache.org/licenses/LICENSE-2.0
> --
> Jonas Schäfer
> DevOps Engineer
>
> Cloud&Heat Technologies GmbH
> Königsbrücker Straße 96 | 01099 Dresden
> +49 351 479 367 37
> jonas.schaefer at cloudandheat.com | www.cloudandheat.com
>
> New Service:
> Managed Kubernetes designed for AI & ML
> https://managed-kubernetes.cloudandheat.com/
>
> Commercial Register: District Court Dresden
> Register Number: HRB 30549
> VAT ID No.: DE281093504
> Managing Director: Nicolas Röhrs
> Authorized signatory: Dr. Marius Feldmann
> Authorized signatory: Kristina Rübenkamp
More information about the openstack-discuss
mailing list