[neutron][ops] API for viewing HA router states

Assaf Muller amuller at redhat.com
Wed Aug 19 16:50:04 UTC 2020


On Tue, Aug 18, 2020 at 10:30 PM Mohammed Naser <mnaser at vexxhost.com> wrote:
>
> On Tue, Aug 18, 2020 at 10:53 AM Assaf Muller <amuller at redhat.com> wrote:
> >
> > On Tue, Aug 18, 2020 at 8:12 AM Jonas Schäfer
> > <jonas.schaefer at cloudandheat.com> wrote:
> > >
> > > Hi Mohammed and all,
> > >
> > > On Montag, 17. August 2020 14:01:55 CEST Mohammed Naser wrote:
> > > > Over the past few days, we were troubleshooting an issue that ended up
> > > > having a root cause where keepalived has somehow ended up active in
> > > > two different L3 agents.  We've yet to find the root cause of how this
> > > > happened but removing it and adding it resolved the issue for us.
> > >
> > > We’ve also seen that behaviour occasionally. The root cause is also unclear
> > > for us (so we would’ve love to hear about that).
> >
> > Insert shameless plug for the Neutron OVN backend. One of it's
> > advantages is that it's L3 HA architecture is cleaner and more
> > scalable (this is coming from the dude that wrote the L3 HA code we're
> > all suffering from =D). The ML2/OVS L3 HA architecture has it's issues
> > - I've seen it work at 100's of customer sites at scale, so I don't
> > want to knock it too much, but just a day ago I got an internal
> > customer ticket about keepalived falling over on a particular router
> > that has 200 floating IPs. It works but it's not perfect. I'm sure the
> > OVN implementation isn't either but it's simply cleaner and has less
> > moving parts. It uses BFD to monitor the tunnel endpoints, so failover
> > is faster too. Plus, it doesn't use keepalived.
> >
>
> OVN is something we're looking at and we're very excited about,
> unfortunately, there seems to be a bunch of gaps in documentation

Can you elaborate?  If you can write down a list of gaps we can address that.

> right now as well as a lot of the migration scripts to OVN are
> TripleO-y.
>
> So it'll take time to get us there, but yes, OVN simplifies this greatly
>
> > > We have anecdotal evidence
> > > that a rabbitmq failure was involved, although that makes no sense to me
> > > personally. Other causes may be incorrectly cleaned-up namespaces (for
> > > example, when you kill or hard-restart the l3 agent, the namespaces will stay
> > > around, possibly with the IP address assigned; the keepalived on the other l3
> > > agents will not see the VRRP advertisments anymore and will ALSO assign the IP
> > > address. This will also be rectified by a restart always and may require
> > > manual namespace cleanup with a tool, a node reboot or an agent disable/enable
> > > cycle.).
> > >
> > > > As we work on improving our monitoring, we wanted to implement
> > > > something that gets us the info of # of active routers to check if
> > > > there's a router that has >1 active L3 agent but it's hard because
> > > > hitting the /l3-agents endpoint on _every_ single router hurts a lot
> > > > on performance.
> > > >
> > > > Is there something else that we can watch which might be more
> > > > productive?  FYI -- this all goes in the open and will end up inside
> > > > the openstack-exporter:
> > > > https://github.com/openstack-exporter/openstack-exporter and the Helm
> > > > charts will end up with the alerts:
> > > > https://github.com/openstack-exporter/helm-charts
> > >
> > > While I don’t think it fits in your openstack-exporter design, we are
> > > currently using the attached script (which we also hereby publish under the
> > > terms of the Apache 2.0 license [1]). (Sorry, I lack the time to cleanly
> > > publish it somewhere right now.)
> > >
> > > It checks the state files maintained by the L3 agent conglomerate and exports
> > > metrics about the master-ness of the routers as prometheus metrics.
> > >
> > > Note that this is slightly dangerous since the router IDs are high-cardinality
> > > and using that as a label value in Prometheus is discouraged; you may not want
> > > to do this in a public cloud setting.
> > >
> > > Either way: This allows us to alert on routers where there is not exactly one
> > > master state. Downside is that this requires the thing to run locally on the
> > > l3 agent nodes. Upside is that it is very efficient, and will also show the
> > > master state in some cases where the router was not cleaned up properly (e.g.
> > > because the l3 agent and its keepaliveds were killed).
> > > kind regards,
> > > Jonas
> > >
> > >    [1]: http://www.apache.org/licenses/LICENSE-2.0
> > > --
> > > Jonas Schäfer
> > > DevOps Engineer
> > >
> > > Cloud&Heat Technologies GmbH
> > > Königsbrücker Straße 96 | 01099 Dresden
> > > +49 351 479 367 37
> > > jonas.schaefer at cloudandheat.com | www.cloudandheat.com
> > >
> > > New Service:
> > > Managed Kubernetes designed for AI & ML
> > > https://managed-kubernetes.cloudandheat.com/
> > >
> > > Commercial Register: District Court Dresden
> > > Register Number: HRB 30549
> > > VAT ID No.: DE281093504
> > > Managing Director: Nicolas Röhrs
> > > Authorized signatory: Dr. Marius Feldmann
> > > Authorized signatory: Kristina Rübenkamp
> >
> >
>
>
> --
> Mohammed Naser
> VEXXHOST, Inc.
>




More information about the openstack-discuss mailing list