[neutron][ops] API for viewing HA router states
Hi all, Over the past few days, we were troubleshooting an issue that ended up having a root cause where keepalived has somehow ended up active in two different L3 agents. We've yet to find the root cause of how this happened but removing it and adding it resolved the issue for us. As we work on improving our monitoring, we wanted to implement something that gets us the info of # of active routers to check if there's a router that has >1 active L3 agent but it's hard because hitting the /l3-agents endpoint on _every_ single router hurts a lot on performance. Is there something else that we can watch which might be more productive? FYI -- this all goes in the open and will end up inside the openstack-exporter: https://github.com/openstack-exporter/openstack-exporter and the Helm charts will end up with the alerts: https://github.com/openstack-exporter/helm-charts Thanks! Mohammed -- Mohammed Naser VEXXHOST, Inc.
Hi, I can just tell you that we are doing a similar check for dhcp-agent, but here we just execute a suitable SQL-statement to detect more than 1 agent / AZ. Doing the same for L3 shouldn't be that hard, but I dont know if this is what you are looking for? Fabian Am Mo., 17. Aug. 2020 um 14:11 Uhr schrieb Mohammed Naser < mnaser@vexxhost.com>:
Hi all,
Over the past few days, we were troubleshooting an issue that ended up having a root cause where keepalived has somehow ended up active in two different L3 agents. We've yet to find the root cause of how this happened but removing it and adding it resolved the issue for us.
As we work on improving our monitoring, we wanted to implement something that gets us the info of # of active routers to check if there's a router that has >1 active L3 agent but it's hard because hitting the /l3-agents endpoint on _every_ single router hurts a lot on performance.
Is there something else that we can watch which might be more productive? FYI -- this all goes in the open and will end up inside the openstack-exporter: https://github.com/openstack-exporter/openstack-exporter and the Helm charts will end up with the alerts: https://github.com/openstack-exporter/helm-charts
Thanks! Mohammed
-- Mohammed Naser VEXXHOST, Inc.
On Mon, Aug 17, 2020 at 9:59 AM Fabian Zimmermann <dev.faz@gmail.com> wrote:
Hi,
I can just tell you that we are doing a similar check for dhcp-agent, but here we just execute a suitable SQL-statement to detect more than 1 agent / AZ.
Doing the same for L3 shouldn't be that hard, but I dont know if this is what you are looking for?
There's already an API for this: neutron l3-agent-list-hosting-router <router_id> It will show you the HA state per L3 agent for the given router.
Fabian
Am Mo., 17. Aug. 2020 um 14:11 Uhr schrieb Mohammed Naser <mnaser@vexxhost.com>:
Hi all,
Over the past few days, we were troubleshooting an issue that ended up having a root cause where keepalived has somehow ended up active in two different L3 agents. We've yet to find the root cause of how this happened but removing it and adding it resolved the issue for us.
As we work on improving our monitoring, we wanted to implement something that gets us the info of # of active routers to check if there's a router that has >1 active L3 agent but it's hard because hitting the /l3-agents endpoint on _every_ single router hurts a lot on performance.
Is there something else that we can watch which might be more productive? FYI -- this all goes in the open and will end up inside the openstack-exporter: https://github.com/openstack-exporter/openstack-exporter and the Helm charts will end up with the alerts: https://github.com/openstack-exporter/helm-charts
Thanks! Mohammed
-- Mohammed Naser VEXXHOST, Inc.
Hi, yes for 1 router, but doing this in a loop for hundreds is not so performant ;) Fabian Am Mo., 17. Aug. 2020 um 16:04 Uhr schrieb Assaf Muller <amuller@redhat.com>:
On Mon, Aug 17, 2020 at 9:59 AM Fabian Zimmermann <dev.faz@gmail.com> wrote:
Hi,
I can just tell you that we are doing a similar check for dhcp-agent, but here we just execute a suitable SQL-statement to detect more than 1 agent / AZ.
Doing the same for L3 shouldn't be that hard, but I dont know if this is what you are looking for?
There's already an API for this: neutron l3-agent-list-hosting-router <router_id>
It will show you the HA state per L3 agent for the given router.
Fabian
Am Mo., 17. Aug. 2020 um 14:11 Uhr schrieb Mohammed Naser <mnaser@vexxhost.com>:
Hi all,
Over the past few days, we were troubleshooting an issue that ended up having a root cause where keepalived has somehow ended up active in two different L3 agents. We've yet to find the root cause of how this happened but removing it and adding it resolved the issue for us.
As we work on improving our monitoring, we wanted to implement something that gets us the info of # of active routers to check if there's a router that has >1 active L3 agent but it's hard because hitting the /l3-agents endpoint on _every_ single router hurts a lot on performance.
Is there something else that we can watch which might be more productive? FYI -- this all goes in the open and will end up inside the openstack-exporter: https://github.com/openstack-exporter/openstack-exporter and the Helm charts will end up with the alerts: https://github.com/openstack-exporter/helm-charts
Thanks! Mohammed
-- Mohammed Naser VEXXHOST, Inc.
Hi all: What Fabian is describing is exactly the problem we're having, there are _many_ routers in these environments so we'd be looking at N requests which can get out of control quickly Thanks Mohammed On Mon, Aug 17, 2020 at 10:05 AM Fabian Zimmermann <dev.faz@gmail.com> wrote:
Hi,
yes for 1 router, but doing this in a loop for hundreds is not so performant ;)
Fabian
Am Mo., 17. Aug. 2020 um 16:04 Uhr schrieb Assaf Muller <amuller@redhat.com>:
On Mon, Aug 17, 2020 at 9:59 AM Fabian Zimmermann <dev.faz@gmail.com> wrote:
Hi,
I can just tell you that we are doing a similar check for dhcp-agent, but here we just execute a suitable SQL-statement to detect more than 1 agent / AZ.
Doing the same for L3 shouldn't be that hard, but I dont know if this is what you are looking for?
There's already an API for this: neutron l3-agent-list-hosting-router <router_id>
It will show you the HA state per L3 agent for the given router.
Fabian
Am Mo., 17. Aug. 2020 um 14:11 Uhr schrieb Mohammed Naser <mnaser@vexxhost.com>:
Hi all,
Over the past few days, we were troubleshooting an issue that ended up having a root cause where keepalived has somehow ended up active in two different L3 agents. We've yet to find the root cause of how this happened but removing it and adding it resolved the issue for us.
As we work on improving our monitoring, we wanted to implement something that gets us the info of # of active routers to check if there's a router that has >1 active L3 agent but it's hard because hitting the /l3-agents endpoint on _every_ single router hurts a lot on performance.
Is there something else that we can watch which might be more productive? FYI -- this all goes in the open and will end up inside the openstack-exporter: https://github.com/openstack-exporter/openstack-exporter and the Helm charts will end up with the alerts: https://github.com/openstack-exporter/helm-charts
Thanks! Mohammed
-- Mohammed Naser VEXXHOST, Inc.
-- Mohammed Naser VEXXHOST, Inc.
On Mon, Aug 17, 2020 at 10:19 AM Mohammed Naser <mnaser@vexxhost.com> wrote:
Hi all:
What Fabian is describing is exactly the problem we're having, there are _many_ routers in these environments so we'd be looking at N requests which can get out of control quickly
I think it's a clear use case to implement a new API endpoint that returns HA state per agent for *all* routers in a single call. Should be easy to implement.
Thanks Mohammed
On Mon, Aug 17, 2020 at 10:05 AM Fabian Zimmermann <dev.faz@gmail.com> wrote:
Hi,
yes for 1 router, but doing this in a loop for hundreds is not so performant ;)
Fabian
Am Mo., 17. Aug. 2020 um 16:04 Uhr schrieb Assaf Muller <amuller@redhat.com>:
On Mon, Aug 17, 2020 at 9:59 AM Fabian Zimmermann <dev.faz@gmail.com> wrote:
Hi,
I can just tell you that we are doing a similar check for dhcp-agent, but here we just execute a suitable SQL-statement to detect more than 1 agent / AZ.
Doing the same for L3 shouldn't be that hard, but I dont know if this is what you are looking for?
There's already an API for this: neutron l3-agent-list-hosting-router <router_id>
It will show you the HA state per L3 agent for the given router.
Fabian
Am Mo., 17. Aug. 2020 um 14:11 Uhr schrieb Mohammed Naser <mnaser@vexxhost.com>:
Hi all,
Over the past few days, we were troubleshooting an issue that ended up having a root cause where keepalived has somehow ended up active in two different L3 agents. We've yet to find the root cause of how this happened but removing it and adding it resolved the issue for us.
As we work on improving our monitoring, we wanted to implement something that gets us the info of # of active routers to check if there's a router that has >1 active L3 agent but it's hard because hitting the /l3-agents endpoint on _every_ single router hurts a lot on performance.
Is there something else that we can watch which might be more productive? FYI -- this all goes in the open and will end up inside the openstack-exporter: https://github.com/openstack-exporter/openstack-exporter and the Helm charts will end up with the alerts: https://github.com/openstack-exporter/helm-charts
Thanks! Mohammed
-- Mohammed Naser VEXXHOST, Inc.
-- Mohammed Naser VEXXHOST, Inc.
Hi, On Mon, Aug 17, 2020 at 11:39:44AM -0400, Assaf Muller wrote:
On Mon, Aug 17, 2020 at 10:19 AM Mohammed Naser <mnaser@vexxhost.com> wrote:
Hi all:
What Fabian is describing is exactly the problem we're having, there are _many_ routers in these environments so we'd be looking at N requests which can get out of control quickly
I think it's a clear use case to implement a new API endpoint that returns HA state per agent for *all* routers in a single call. Should be easy to implement.
I agree with that. Can You maybe propose official RFE for that and describe there Your use case - see [1] for details.
Thanks Mohammed
On Mon, Aug 17, 2020 at 10:05 AM Fabian Zimmermann <dev.faz@gmail.com> wrote:
Hi,
yes for 1 router, but doing this in a loop for hundreds is not so performant ;)
Fabian
Am Mo., 17. Aug. 2020 um 16:04 Uhr schrieb Assaf Muller <amuller@redhat.com>:
On Mon, Aug 17, 2020 at 9:59 AM Fabian Zimmermann <dev.faz@gmail.com> wrote:
Hi,
I can just tell you that we are doing a similar check for dhcp-agent, but here we just execute a suitable SQL-statement to detect more than 1 agent / AZ.
Doing the same for L3 shouldn't be that hard, but I dont know if this is what you are looking for?
There's already an API for this: neutron l3-agent-list-hosting-router <router_id>
It will show you the HA state per L3 agent for the given router.
Fabian
Am Mo., 17. Aug. 2020 um 14:11 Uhr schrieb Mohammed Naser <mnaser@vexxhost.com>:
Hi all,
Over the past few days, we were troubleshooting an issue that ended up having a root cause where keepalived has somehow ended up active in two different L3 agents. We've yet to find the root cause of how this happened but removing it and adding it resolved the issue for us.
As we work on improving our monitoring, we wanted to implement something that gets us the info of # of active routers to check if there's a router that has >1 active L3 agent but it's hard because hitting the /l3-agents endpoint on _every_ single router hurts a lot on performance.
Is there something else that we can watch which might be more productive? FYI -- this all goes in the open and will end up inside the openstack-exporter: https://github.com/openstack-exporter/openstack-exporter and the Helm charts will end up with the alerts: https://github.com/openstack-exporter/helm-charts
Thanks! Mohammed
-- Mohammed Naser VEXXHOST, Inc.
-- Mohammed Naser VEXXHOST, Inc.
[1] https://docs.openstack.org/neutron/latest/contributor/policies/blueprints.ht... -- Slawek Kaplonski Principal software engineer Red Hat
Hi Mohammed and all, On Montag, 17. August 2020 14:01:55 CEST Mohammed Naser wrote:
Over the past few days, we were troubleshooting an issue that ended up having a root cause where keepalived has somehow ended up active in two different L3 agents. We've yet to find the root cause of how this happened but removing it and adding it resolved the issue for us.
We’ve also seen that behaviour occasionally. The root cause is also unclear for us (so we would’ve love to hear about that). We have anecdotal evidence that a rabbitmq failure was involved, although that makes no sense to me personally. Other causes may be incorrectly cleaned-up namespaces (for example, when you kill or hard-restart the l3 agent, the namespaces will stay around, possibly with the IP address assigned; the keepalived on the other l3 agents will not see the VRRP advertisments anymore and will ALSO assign the IP address. This will also be rectified by a restart always and may require manual namespace cleanup with a tool, a node reboot or an agent disable/enable cycle.).
As we work on improving our monitoring, we wanted to implement something that gets us the info of # of active routers to check if there's a router that has >1 active L3 agent but it's hard because hitting the /l3-agents endpoint on _every_ single router hurts a lot on performance.
Is there something else that we can watch which might be more productive? FYI -- this all goes in the open and will end up inside the openstack-exporter: https://github.com/openstack-exporter/openstack-exporter and the Helm charts will end up with the alerts: https://github.com/openstack-exporter/helm-charts
While I don’t think it fits in your openstack-exporter design, we are currently using the attached script (which we also hereby publish under the terms of the Apache 2.0 license [1]). (Sorry, I lack the time to cleanly publish it somewhere right now.) It checks the state files maintained by the L3 agent conglomerate and exports metrics about the master-ness of the routers as prometheus metrics. Note that this is slightly dangerous since the router IDs are high-cardinality and using that as a label value in Prometheus is discouraged; you may not want to do this in a public cloud setting. Either way: This allows us to alert on routers where there is not exactly one master state. Downside is that this requires the thing to run locally on the l3 agent nodes. Upside is that it is very efficient, and will also show the master state in some cases where the router was not cleaned up properly (e.g. because the l3 agent and its keepaliveds were killed). kind regards, Jonas [1]: http://www.apache.org/licenses/LICENSE-2.0 -- Jonas Schäfer DevOps Engineer Cloud&Heat Technologies GmbH Königsbrücker Straße 96 | 01099 Dresden +49 351 479 367 37 jonas.schaefer@cloudandheat.com | www.cloudandheat.com New Service: Managed Kubernetes designed for AI & ML https://managed-kubernetes.cloudandheat.com/ Commercial Register: District Court Dresden Register Number: HRB 30549 VAT ID No.: DE281093504 Managing Director: Nicolas Röhrs Authorized signatory: Dr. Marius Feldmann Authorized signatory: Kristina Rübenkamp
On Tue, Aug 18, 2020 at 8:12 AM Jonas Schäfer <jonas.schaefer@cloudandheat.com> wrote:
Hi Mohammed and all,
On Montag, 17. August 2020 14:01:55 CEST Mohammed Naser wrote:
Over the past few days, we were troubleshooting an issue that ended up having a root cause where keepalived has somehow ended up active in two different L3 agents. We've yet to find the root cause of how this happened but removing it and adding it resolved the issue for us.
We’ve also seen that behaviour occasionally. The root cause is also unclear for us (so we would’ve love to hear about that).
Insert shameless plug for the Neutron OVN backend. One of it's advantages is that it's L3 HA architecture is cleaner and more scalable (this is coming from the dude that wrote the L3 HA code we're all suffering from =D). The ML2/OVS L3 HA architecture has it's issues - I've seen it work at 100's of customer sites at scale, so I don't want to knock it too much, but just a day ago I got an internal customer ticket about keepalived falling over on a particular router that has 200 floating IPs. It works but it's not perfect. I'm sure the OVN implementation isn't either but it's simply cleaner and has less moving parts. It uses BFD to monitor the tunnel endpoints, so failover is faster too. Plus, it doesn't use keepalived.
We have anecdotal evidence that a rabbitmq failure was involved, although that makes no sense to me personally. Other causes may be incorrectly cleaned-up namespaces (for example, when you kill or hard-restart the l3 agent, the namespaces will stay around, possibly with the IP address assigned; the keepalived on the other l3 agents will not see the VRRP advertisments anymore and will ALSO assign the IP address. This will also be rectified by a restart always and may require manual namespace cleanup with a tool, a node reboot or an agent disable/enable cycle.).
As we work on improving our monitoring, we wanted to implement something that gets us the info of # of active routers to check if there's a router that has >1 active L3 agent but it's hard because hitting the /l3-agents endpoint on _every_ single router hurts a lot on performance.
Is there something else that we can watch which might be more productive? FYI -- this all goes in the open and will end up inside the openstack-exporter: https://github.com/openstack-exporter/openstack-exporter and the Helm charts will end up with the alerts: https://github.com/openstack-exporter/helm-charts
While I don’t think it fits in your openstack-exporter design, we are currently using the attached script (which we also hereby publish under the terms of the Apache 2.0 license [1]). (Sorry, I lack the time to cleanly publish it somewhere right now.)
It checks the state files maintained by the L3 agent conglomerate and exports metrics about the master-ness of the routers as prometheus metrics.
Note that this is slightly dangerous since the router IDs are high-cardinality and using that as a label value in Prometheus is discouraged; you may not want to do this in a public cloud setting.
Either way: This allows us to alert on routers where there is not exactly one master state. Downside is that this requires the thing to run locally on the l3 agent nodes. Upside is that it is very efficient, and will also show the master state in some cases where the router was not cleaned up properly (e.g. because the l3 agent and its keepaliveds were killed).
kind regards, Jonas
[1]: http://www.apache.org/licenses/LICENSE-2.0 -- Jonas Schäfer DevOps Engineer
Cloud&Heat Technologies GmbH Königsbrücker Straße 96 | 01099 Dresden +49 351 479 367 37 jonas.schaefer@cloudandheat.com | www.cloudandheat.com
New Service: Managed Kubernetes designed for AI & ML https://managed-kubernetes.cloudandheat.com/
Commercial Register: District Court Dresden Register Number: HRB 30549 VAT ID No.: DE281093504 Managing Director: Nicolas Röhrs Authorized signatory: Dr. Marius Feldmann Authorized signatory: Kristina Rübenkamp
On Tue, Aug 18, 2020 at 10:53 AM Assaf Muller <amuller@redhat.com> wrote:
On Tue, Aug 18, 2020 at 8:12 AM Jonas Schäfer <jonas.schaefer@cloudandheat.com> wrote:
Hi Mohammed and all,
On Montag, 17. August 2020 14:01:55 CEST Mohammed Naser wrote:
Over the past few days, we were troubleshooting an issue that ended up having a root cause where keepalived has somehow ended up active in two different L3 agents. We've yet to find the root cause of how this happened but removing it and adding it resolved the issue for us.
We’ve also seen that behaviour occasionally. The root cause is also unclear for us (so we would’ve love to hear about that).
Insert shameless plug for the Neutron OVN backend. One of it's advantages is that it's L3 HA architecture is cleaner and more scalable (this is coming from the dude that wrote the L3 HA code we're all suffering from =D). The ML2/OVS L3 HA architecture has it's issues - I've seen it work at 100's of customer sites at scale, so I don't want to knock it too much, but just a day ago I got an internal customer ticket about keepalived falling over on a particular router that has 200 floating IPs. It works but it's not perfect. I'm sure the OVN implementation isn't either but it's simply cleaner and has less moving parts. It uses BFD to monitor the tunnel endpoints, so failover is faster too. Plus, it doesn't use keepalived.
OVN is something we're looking at and we're very excited about, unfortunately, there seems to be a bunch of gaps in documentation right now as well as a lot of the migration scripts to OVN are TripleO-y. So it'll take time to get us there, but yes, OVN simplifies this greatly
We have anecdotal evidence that a rabbitmq failure was involved, although that makes no sense to me personally. Other causes may be incorrectly cleaned-up namespaces (for example, when you kill or hard-restart the l3 agent, the namespaces will stay around, possibly with the IP address assigned; the keepalived on the other l3 agents will not see the VRRP advertisments anymore and will ALSO assign the IP address. This will also be rectified by a restart always and may require manual namespace cleanup with a tool, a node reboot or an agent disable/enable cycle.).
As we work on improving our monitoring, we wanted to implement something that gets us the info of # of active routers to check if there's a router that has >1 active L3 agent but it's hard because hitting the /l3-agents endpoint on _every_ single router hurts a lot on performance.
Is there something else that we can watch which might be more productive? FYI -- this all goes in the open and will end up inside the openstack-exporter: https://github.com/openstack-exporter/openstack-exporter and the Helm charts will end up with the alerts: https://github.com/openstack-exporter/helm-charts
While I don’t think it fits in your openstack-exporter design, we are currently using the attached script (which we also hereby publish under the terms of the Apache 2.0 license [1]). (Sorry, I lack the time to cleanly publish it somewhere right now.)
It checks the state files maintained by the L3 agent conglomerate and exports metrics about the master-ness of the routers as prometheus metrics.
Note that this is slightly dangerous since the router IDs are high-cardinality and using that as a label value in Prometheus is discouraged; you may not want to do this in a public cloud setting.
Either way: This allows us to alert on routers where there is not exactly one master state. Downside is that this requires the thing to run locally on the l3 agent nodes. Upside is that it is very efficient, and will also show the master state in some cases where the router was not cleaned up properly (e.g. because the l3 agent and its keepaliveds were killed). kind regards, Jonas
[1]: http://www.apache.org/licenses/LICENSE-2.0 -- Jonas Schäfer DevOps Engineer
Cloud&Heat Technologies GmbH Königsbrücker Straße 96 | 01099 Dresden +49 351 479 367 37 jonas.schaefer@cloudandheat.com | www.cloudandheat.com
New Service: Managed Kubernetes designed for AI & ML https://managed-kubernetes.cloudandheat.com/
Commercial Register: District Court Dresden Register Number: HRB 30549 VAT ID No.: DE281093504 Managing Director: Nicolas Röhrs Authorized signatory: Dr. Marius Feldmann Authorized signatory: Kristina Rübenkamp
-- Mohammed Naser VEXXHOST, Inc.
On Tue, Aug 18, 2020 at 10:30 PM Mohammed Naser <mnaser@vexxhost.com> wrote:
On Tue, Aug 18, 2020 at 10:53 AM Assaf Muller <amuller@redhat.com> wrote:
On Tue, Aug 18, 2020 at 8:12 AM Jonas Schäfer <jonas.schaefer@cloudandheat.com> wrote:
Hi Mohammed and all,
On Montag, 17. August 2020 14:01:55 CEST Mohammed Naser wrote:
Over the past few days, we were troubleshooting an issue that ended up having a root cause where keepalived has somehow ended up active in two different L3 agents. We've yet to find the root cause of how this happened but removing it and adding it resolved the issue for us.
We’ve also seen that behaviour occasionally. The root cause is also unclear for us (so we would’ve love to hear about that).
Insert shameless plug for the Neutron OVN backend. One of it's advantages is that it's L3 HA architecture is cleaner and more scalable (this is coming from the dude that wrote the L3 HA code we're all suffering from =D). The ML2/OVS L3 HA architecture has it's issues - I've seen it work at 100's of customer sites at scale, so I don't want to knock it too much, but just a day ago I got an internal customer ticket about keepalived falling over on a particular router that has 200 floating IPs. It works but it's not perfect. I'm sure the OVN implementation isn't either but it's simply cleaner and has less moving parts. It uses BFD to monitor the tunnel endpoints, so failover is faster too. Plus, it doesn't use keepalived.
OVN is something we're looking at and we're very excited about, unfortunately, there seems to be a bunch of gaps in documentation
Can you elaborate? If you can write down a list of gaps we can address that.
right now as well as a lot of the migration scripts to OVN are TripleO-y.
So it'll take time to get us there, but yes, OVN simplifies this greatly
We have anecdotal evidence that a rabbitmq failure was involved, although that makes no sense to me personally. Other causes may be incorrectly cleaned-up namespaces (for example, when you kill or hard-restart the l3 agent, the namespaces will stay around, possibly with the IP address assigned; the keepalived on the other l3 agents will not see the VRRP advertisments anymore and will ALSO assign the IP address. This will also be rectified by a restart always and may require manual namespace cleanup with a tool, a node reboot or an agent disable/enable cycle.).
As we work on improving our monitoring, we wanted to implement something that gets us the info of # of active routers to check if there's a router that has >1 active L3 agent but it's hard because hitting the /l3-agents endpoint on _every_ single router hurts a lot on performance.
Is there something else that we can watch which might be more productive? FYI -- this all goes in the open and will end up inside the openstack-exporter: https://github.com/openstack-exporter/openstack-exporter and the Helm charts will end up with the alerts: https://github.com/openstack-exporter/helm-charts
While I don’t think it fits in your openstack-exporter design, we are currently using the attached script (which we also hereby publish under the terms of the Apache 2.0 license [1]). (Sorry, I lack the time to cleanly publish it somewhere right now.)
It checks the state files maintained by the L3 agent conglomerate and exports metrics about the master-ness of the routers as prometheus metrics.
Note that this is slightly dangerous since the router IDs are high-cardinality and using that as a label value in Prometheus is discouraged; you may not want to do this in a public cloud setting.
Either way: This allows us to alert on routers where there is not exactly one master state. Downside is that this requires the thing to run locally on the l3 agent nodes. Upside is that it is very efficient, and will also show the master state in some cases where the router was not cleaned up properly (e.g. because the l3 agent and its keepaliveds were killed). kind regards, Jonas
[1]: http://www.apache.org/licenses/LICENSE-2.0 -- Jonas Schäfer DevOps Engineer
Cloud&Heat Technologies GmbH Königsbrücker Straße 96 | 01099 Dresden +49 351 479 367 37 jonas.schaefer@cloudandheat.com | www.cloudandheat.com
New Service: Managed Kubernetes designed for AI & ML https://managed-kubernetes.cloudandheat.com/
Commercial Register: District Court Dresden Register Number: HRB 30549 VAT ID No.: DE281093504 Managing Director: Nicolas Röhrs Authorized signatory: Dr. Marius Feldmann Authorized signatory: Kristina Rübenkamp
-- Mohammed Naser VEXXHOST, Inc.
On Tue, Aug 18, 2020 at 8:12 AM Jonas Schäfer <jonas.schaefer@cloudandheat.com> wrote:
Hi Mohammed and all,
On Montag, 17. August 2020 14:01:55 CEST Mohammed Naser wrote:
Over the past few days, we were troubleshooting an issue that ended up having a root cause where keepalived has somehow ended up active in two different L3 agents. We've yet to find the root cause of how this happened but removing it and adding it resolved the issue for us.
We’ve also seen that behaviour occasionally. The root cause is also unclear for us (so we would’ve love to hear about that). We have anecdotal evidence that a rabbitmq failure was involved, although that makes no sense to me personally. Other causes may be incorrectly cleaned-up namespaces (for example, when you kill or hard-restart the l3 agent, the namespaces will stay around, possibly with the IP address assigned; the keepalived on the other l3 agents will not see the VRRP advertisments anymore and will ALSO assign the IP address. This will also be rectified by a restart always and may require manual namespace cleanup with a tool, a node reboot or an agent disable/enable cycle.).
As we work on improving our monitoring, we wanted to implement something that gets us the info of # of active routers to check if there's a router that has >1 active L3 agent but it's hard because hitting the /l3-agents endpoint on _every_ single router hurts a lot on performance.
Is there something else that we can watch which might be more productive? FYI -- this all goes in the open and will end up inside the openstack-exporter: https://github.com/openstack-exporter/openstack-exporter and the Helm charts will end up with the alerts: https://github.com/openstack-exporter/helm-charts
While I don’t think it fits in your openstack-exporter design, we are currently using the attached script (which we also hereby publish under the terms of the Apache 2.0 license [1]). (Sorry, I lack the time to cleanly publish it somewhere right now.)
It checks the state files maintained by the L3 agent conglomerate and exports metrics about the master-ness of the routers as prometheus metrics.
Note that this is slightly dangerous since the router IDs are high-cardinality and using that as a label value in Prometheus is discouraged; you may not want to do this in a public cloud setting.
Either way: This allows us to alert on routers where there is not exactly one master state. Downside is that this requires the thing to run locally on the l3 agent nodes. Upside is that it is very efficient, and will also show the master state in some cases where the router was not cleaned up properly (e.g. because the l3 agent and its keepaliveds were killed).
This seems sweet. Let me go over the code. I might package this up into something consumable and host it inside OpenDev, if that's okay with you?
kind regards, Jonas
[1]: http://www.apache.org/licenses/LICENSE-2.0 -- Jonas Schäfer DevOps Engineer
Cloud&Heat Technologies GmbH Königsbrücker Straße 96 | 01099 Dresden +49 351 479 367 37 jonas.schaefer@cloudandheat.com | www.cloudandheat.com
New Service: Managed Kubernetes designed for AI & ML https://managed-kubernetes.cloudandheat.com/
Commercial Register: District Court Dresden Register Number: HRB 30549 VAT ID No.: DE281093504 Managing Director: Nicolas Röhrs Authorized signatory: Dr. Marius Feldmann Authorized signatory: Kristina Rübenkamp
-- Mohammed Naser VEXXHOST, Inc.
On Mittwoch, 19. August 2020 04:31:17 CEST you wrote:
This seems sweet. Let me go over the code. I might package this up into something consumable and host it inside OpenDev, if that's okay with you?
Yes sure. I would’ve proposed it for x/osops-tools-contrib myself, but unfortunately I’m very short on time to work on this right now. So thanks for taking this on. kind regards, -- Jonas Schäfer DevOps Engineer Cloud&Heat Technologies GmbH Königsbrücker Straße 96 | 01099 Dresden +49 351 479 367 37 jonas.schaefer@cloudandheat.com | www.cloudandheat.com New Service: Managed Kubernetes designed for AI & ML https://managed-kubernetes.cloudandheat.com/ Commercial Register: District Court Dresden Register Number: HRB 30549 VAT ID No.: DE281093504 Managing Director: Nicolas Röhrs Authorized signatory: Dr. Marius Feldmann Authorized signatory: Kristina Rübenkamp
participants (5)
-
Assaf Muller
-
Fabian Zimmermann
-
Jonas Schäfer
-
Mohammed Naser
-
Slawek Kaplonski