[neutron][ops] API for viewing HA router states
Mohammed Naser
mnaser at vexxhost.com
Wed Aug 19 02:30:08 UTC 2020
On Tue, Aug 18, 2020 at 10:53 AM Assaf Muller <amuller at redhat.com> wrote:
>
> On Tue, Aug 18, 2020 at 8:12 AM Jonas Schäfer
> <jonas.schaefer at cloudandheat.com> wrote:
> >
> > Hi Mohammed and all,
> >
> > On Montag, 17. August 2020 14:01:55 CEST Mohammed Naser wrote:
> > > Over the past few days, we were troubleshooting an issue that ended up
> > > having a root cause where keepalived has somehow ended up active in
> > > two different L3 agents. We've yet to find the root cause of how this
> > > happened but removing it and adding it resolved the issue for us.
> >
> > We’ve also seen that behaviour occasionally. The root cause is also unclear
> > for us (so we would’ve love to hear about that).
>
> Insert shameless plug for the Neutron OVN backend. One of it's
> advantages is that it's L3 HA architecture is cleaner and more
> scalable (this is coming from the dude that wrote the L3 HA code we're
> all suffering from =D). The ML2/OVS L3 HA architecture has it's issues
> - I've seen it work at 100's of customer sites at scale, so I don't
> want to knock it too much, but just a day ago I got an internal
> customer ticket about keepalived falling over on a particular router
> that has 200 floating IPs. It works but it's not perfect. I'm sure the
> OVN implementation isn't either but it's simply cleaner and has less
> moving parts. It uses BFD to monitor the tunnel endpoints, so failover
> is faster too. Plus, it doesn't use keepalived.
>
OVN is something we're looking at and we're very excited about,
unfortunately, there seems to be a bunch of gaps in documentation
right now as well as a lot of the migration scripts to OVN are
TripleO-y.
So it'll take time to get us there, but yes, OVN simplifies this greatly
> > We have anecdotal evidence
> > that a rabbitmq failure was involved, although that makes no sense to me
> > personally. Other causes may be incorrectly cleaned-up namespaces (for
> > example, when you kill or hard-restart the l3 agent, the namespaces will stay
> > around, possibly with the IP address assigned; the keepalived on the other l3
> > agents will not see the VRRP advertisments anymore and will ALSO assign the IP
> > address. This will also be rectified by a restart always and may require
> > manual namespace cleanup with a tool, a node reboot or an agent disable/enable
> > cycle.).
> >
> > > As we work on improving our monitoring, we wanted to implement
> > > something that gets us the info of # of active routers to check if
> > > there's a router that has >1 active L3 agent but it's hard because
> > > hitting the /l3-agents endpoint on _every_ single router hurts a lot
> > > on performance.
> > >
> > > Is there something else that we can watch which might be more
> > > productive? FYI -- this all goes in the open and will end up inside
> > > the openstack-exporter:
> > > https://github.com/openstack-exporter/openstack-exporter and the Helm
> > > charts will end up with the alerts:
> > > https://github.com/openstack-exporter/helm-charts
> >
> > While I don’t think it fits in your openstack-exporter design, we are
> > currently using the attached script (which we also hereby publish under the
> > terms of the Apache 2.0 license [1]). (Sorry, I lack the time to cleanly
> > publish it somewhere right now.)
> >
> > It checks the state files maintained by the L3 agent conglomerate and exports
> > metrics about the master-ness of the routers as prometheus metrics.
> >
> > Note that this is slightly dangerous since the router IDs are high-cardinality
> > and using that as a label value in Prometheus is discouraged; you may not want
> > to do this in a public cloud setting.
> >
> > Either way: This allows us to alert on routers where there is not exactly one
> > master state. Downside is that this requires the thing to run locally on the
> > l3 agent nodes. Upside is that it is very efficient, and will also show the
> > master state in some cases where the router was not cleaned up properly (e.g.
> > because the l3 agent and its keepaliveds were killed).
> > kind regards,
> > Jonas
> >
> > [1]: http://www.apache.org/licenses/LICENSE-2.0
> > --
> > Jonas Schäfer
> > DevOps Engineer
> >
> > Cloud&Heat Technologies GmbH
> > Königsbrücker Straße 96 | 01099 Dresden
> > +49 351 479 367 37
> > jonas.schaefer at cloudandheat.com | www.cloudandheat.com
> >
> > New Service:
> > Managed Kubernetes designed for AI & ML
> > https://managed-kubernetes.cloudandheat.com/
> >
> > Commercial Register: District Court Dresden
> > Register Number: HRB 30549
> > VAT ID No.: DE281093504
> > Managing Director: Nicolas Röhrs
> > Authorized signatory: Dr. Marius Feldmann
> > Authorized signatory: Kristina Rübenkamp
>
>
--
Mohammed Naser
VEXXHOST, Inc.
More information about the openstack-discuss
mailing list