[neutron] detecting l3-agent readiness
Hi folks, I'm working on improving the stability of rollouts when using Kubernetes as a control plane, specifically around the L3 agent, it seems that I have not found a clear way to detect in the code path where the L3 agent has finished it's initial sync.. Am I missing it somewhere or is the architecture built in a way that doesn't really answer that question? Thanks Mohammed -- Mohammed Naser VEXXHOST, Inc.
Hello Mohammed: So far we don't have any mechanism to report the sync status of an agent. I know that, for example, the DHCP agent reports an INFO message with the statement 'Synchronizing state complete'. But other agents don't provide this information or you need to manually observe the logs to detect that. Because this could be an interesting information, I'll open a RFE bug to try to bring this information to the existing agents. Regards. On Sun, Mar 12, 2023 at 11:11 AM Mohammed Naser <mnaser@vexxhost.com> wrote:
Hi folks,
I'm working on improving the stability of rollouts when using Kubernetes as a control plane, specifically around the L3 agent, it seems that I have not found a clear way to detect in the code path where the L3 agent has finished it's initial sync..
Am I missing it somewhere or is the architecture built in a way that doesn't really answer that question?
Thanks Mohammed
-- Mohammed Naser VEXXHOST, Inc.
It looks like this has sparked a cool ops discussion. I’ve tried an attempt here, though I am not sure how I feel about it yet. https://github.com/vexxhost/atmosphere/pull/359/files I have not extensively tested it but would be good to hear from Neutron team on this approach vs the approach from Felix. On Mon, Mar 13, 2023 at 12:07 PM Rodolfo Alonso Hernandez < ralonsoh@redhat.com> wrote:
Hello Mohammed:
So far we don't have any mechanism to report the sync status of an agent. I know that, for example, the DHCP agent reports an INFO message with the statement 'Synchronizing state complete'. But other agents don't provide this information or you need to manually observe the logs to detect that.
Because this could be an interesting information, I'll open a RFE bug to try to bring this information to the existing agents.
Regards.
On Sun, Mar 12, 2023 at 11:11 AM Mohammed Naser <mnaser@vexxhost.com> wrote:
Hi folks,
I'm working on improving the stability of rollouts when using Kubernetes as a control plane, specifically around the L3 agent, it seems that I have not found a clear way to detect in the code path where the L3 agent has finished it's initial sync..
Am I missing it somewhere or is the architecture built in a way that doesn't really answer that question?
Thanks Mohammed
-- Mohammed Naser VEXXHOST, Inc.
-- Mohammed Naser VEXXHOST, Inc.
Technically is correct but you can imagine what my answer is about enabling the green threads backdoors. This functionality is for troubleshooting only and should not be enabled in a production environment. Just as a temporary workaround, we can add INFO messages in the "periodic_sync_routers_task" method that you can easily parse reading the logs. This patch could be also backported to stable versions. Bug for reporting full sync state in Neutron agents: https://bugs.launchpad.net/neutron/+bug/2011422 On Mon, Mar 13, 2023 at 12:24 PM Mohammed Naser <mnaser@vexxhost.com> wrote:
It looks like this has sparked a cool ops discussion.
I’ve tried an attempt here, though I am not sure how I feel about it yet.
https://github.com/vexxhost/atmosphere/pull/359/files
I have not extensively tested it but would be good to hear from Neutron team on this approach vs the approach from Felix.
On Mon, Mar 13, 2023 at 12:07 PM Rodolfo Alonso Hernandez < ralonsoh@redhat.com> wrote:
Hello Mohammed:
So far we don't have any mechanism to report the sync status of an agent. I know that, for example, the DHCP agent reports an INFO message with the statement 'Synchronizing state complete'. But other agents don't provide this information or you need to manually observe the logs to detect that.
Because this could be an interesting information, I'll open a RFE bug to try to bring this information to the existing agents.
Regards.
On Sun, Mar 12, 2023 at 11:11 AM Mohammed Naser <mnaser@vexxhost.com> wrote:
Hi folks,
I'm working on improving the stability of rollouts when using Kubernetes as a control plane, specifically around the L3 agent, it seems that I have not found a clear way to detect in the code path where the L3 agent has finished it's initial sync..
Am I missing it somewhere or is the architecture built in a way that doesn't really answer that question?
Thanks Mohammed
-- Mohammed Naser VEXXHOST, Inc.
-- Mohammed Naser VEXXHOST, Inc.
Hi Mohammed,
Subject: [neutron] detecting l3-agent readiness
Hi folks,
I'm working on improving the stability of rollouts when using Kubernetes as a control plane, specifically around the L3 agent, it seems that I have not found a clear way to detect in the code path where the L3 agent has finished it's initial sync..
We build such a solution here: https://gitlab.com/yaook/images/neutron-l3-agent/-/blob/devel/files/startup_... Basically we are checking against the neutron api what routers should be on the node and then validate that all keepalived processes are up and running.
Am I missing it somewhere or is the architecture built in a way that doesn't really answer that question?
Adding a option in the neutron api would be a lot nicer. But i guess that also counts for l2 and dhcp agents.
Thanks Mohammed
-- Mohammed Naser VEXXHOST, Inc.
-- Felix Huettner Diese E Mail enthält möglicherweise vertrauliche Inhalte und ist nur für die Verwertung durch den vorgesehenen Empfänger bestimmt. Sollten Sie nicht der vorgesehene Empfänger sein, setzen Sie den Absender bitte unverzüglich in Kenntnis und löschen diese E Mail. Hinweise zum Datenschutz finden Sie hier<https://www.datenschutz.schwarz>.
Hi, Dnia poniedziałek, 13 marca 2023 16:35:43 CET Felix Hüttner pisze:
Hi Mohammed,
Subject: [neutron] detecting l3-agent readiness
Hi folks,
I'm working on improving the stability of rollouts when using Kubernetes as a control plane, specifically around the L3 agent, it seems that I have not found a clear way to detect in the code path where the L3 agent has finished it's initial sync..
We build such a solution here: https://gitlab.com/yaook/images/neutron-l3-agent/-/blob/devel/files/startup_... Basically we are checking against the neutron api what routers should be on the node and then validate that all keepalived processes are up and running.
That would work only for HA routers. If You would also have routers which aren't "ha" this method may fail.
Am I missing it somewhere or is the architecture built in a way that doesn't really answer that question?
Adding a option in the neutron api would be a lot nicer. But i guess that also counts for l2 and dhcp agents.
Thanks Mohammed
-- Mohammed Naser VEXXHOST, Inc.
-- Felix Huettner Diese E Mail enthält möglicherweise vertrauliche Inhalte und ist nur für die Verwertung durch den vorgesehenen Empfänger bestimmt. Sollten Sie nicht der vorgesehene Empfänger sein, setzen Sie den Absender bitte unverzüglich in Kenntnis und löschen diese E Mail. Hinweise zum Datenschutz finden Sie hier<https://www.datenschutz.schwarz>.
-- Slawek Kaplonski Principal Software Engineer Red Hat
Hi,
Subject: Re: [neutron] detecting l3-agent readiness
Hi,
Hi Mohammed,
Subject: [neutron] detecting l3-agent readiness
Hi folks,
I'm working on improving the stability of rollouts when using Kubernetes as a control
We build such a solution here: https://gitlab.com/yaook/images/neutron-l3-agent/- /blob/devel/files/startup_wait_for_ns.py Basically we are checking against the neutron api what routers should be on the node and
Dnia poniedziałek, 13 marca 2023 16:35:43 CET Felix Hüttner pisze: plane, specifically around the L3 agent, it seems that I have not found a clear way to detect in the code path where the L3 agent has finished it's initial sync.. then validate that all keepalived processes are up and running.
That would work only for HA routers. If You would also have routers which aren't "ha" this method may fail.
Yep, since we only have HA routers that works fine for us. But I guess it should also work for non-ha routers without too much adoption (maybe just check for namespaces instead of keepalived).
Am I missing it somewhere or is the architecture built in a way that doesn't really
answer that question?
Adding a option in the neutron api would be a lot nicer. But i guess that also counts for l2 and dhcp agents.
Thanks Mohammed
-- Mohammed Naser VEXXHOST, Inc.
-- Felix Huettner Diese E Mail enthält möglicherweise vertrauliche Inhalte und ist nur für die Verwertung durch den vorgesehenen Empfänger bestimmt. Sollten Sie nicht der vorgesehene Empfänger sein, setzen Sie den Absender bitte unverzüglich in Kenntnis und löschen diese E Mail. Hinweise zum Datenschutz finden Sie hier<https://www.datenschutz.schwarz>.
-- Slawek Kaplonski Principal Software Engineer Red Hat
-- Felix Huettner Diese E Mail enthält möglicherweise vertrauliche Inhalte und ist nur für die Verwertung durch den vorgesehenen Empfänger bestimmt. Sollten Sie nicht der vorgesehene Empfänger sein, setzen Sie den Absender bitte unverzüglich in Kenntnis und löschen diese E Mail. Hinweise zum Datenschutz finden Sie hier<https://www.datenschutz.schwarz>.
On Wed, 2023-03-15 at 16:10 +0000, Felix Hüttner wrote:
Hi,
Subject: Re: [neutron] detecting l3-agent readiness
Hi,
Hi Mohammed,
Subject: [neutron] detecting l3-agent readiness
Hi folks,
I'm working on improving the stability of rollouts when using Kubernetes as a control
We build such a solution here: https://gitlab.com/yaook/images/neutron-l3-agent/- /blob/devel/files/startup_wait_for_ns.py Basically we are checking against the neutron api what routers should be on the node and
Dnia poniedziałek, 13 marca 2023 16:35:43 CET Felix Hüttner pisze: plane, specifically around the L3 agent, it seems that I have not found a clear way to detect in the code path where the L3 agent has finished it's initial sync.. then validate that all keepalived processes are up and running.
That would work only for HA routers. If You would also have routers which aren't "ha" this method may fail.
Yep, since we only have HA routers that works fine for us. But I guess it should also work for non-ha routers without too much adoption (maybe just check for namespaces instead of keepalived).
Instead of counting processes I have been using the l3 agent's `configurations.routers` field to determine its readiness. From my understanding comparing this number with the number of active routers hosted by the agent should be a good indicator of its sync status. Using two api calls for this is inherently racy, but could be a sufficient workaround for environments with a moderate number of router events. So a simple test snippet for the sync status of all agents could be: ``` import sys import openstack client = openstack.connection.Connection( ... ) l3_agent_synced = [ len([ router for router in client.network.agent_hosted_routers(agent) if router.is_admin_state_up ]) <= client.network.get_agent(agent).configuration["routers"] for agent in client.network.agents() if agent.agent_type == "L3 agent" and (agent.configuration["agent_mode"] == "dvr_snat" or agent.configuration["agent_mode"] == "legacy") ] if not all(l3_agent_synced): sys.exit(1) ``` Please let me know if I am way off with this approach :)
Am I missing it somewhere or is the architecture built in a way that doesn't really
answer that question?
Adding a option in the neutron api would be a lot nicer. But i guess that also counts for l2 and dhcp agents.
Thanks Mohammed
-- Mohammed Naser VEXXHOST, Inc.
-- Felix Huettner Diese E Mail enthält möglicherweise vertrauliche Inhalte und ist nur für die Verwertung durch den vorgesehenen Empfänger bestimmt. Sollten Sie nicht der vorgesehene Empfänger sein, setzen Sie den Absender bitte unverzüglich in Kenntnis und löschen diese E Mail. Hinweise zum Datenschutz finden Sie hier<https://www.datenschutz.schwarz>.
-- Slawek Kaplonski Principal Software Engineer Red Hat
-- Felix Huettner Diese E Mail enthält möglicherweise vertrauliche Inhalte und ist nur für die Verwertung durch den vorgesehenen Empfänger bestimmt. Sollten Sie nicht der vorgesehene Empfänger sein, setzen Sie den Absender bitte unverzüglich in Kenntnis und löschen diese E Mail. Hinweise zum Datenschutz finden Sie hier<https://www.datenschutz.schwarz>.
-- Jan Horstmann
Hello: I think I'm repeating myself here but we have two approaches to solve this problem: * The easiest one, at least for the L3 agent, is to report an INFO level log before and after the full sync. That could be parsed by any tool to detect that. You can propose a patch to the Neutron repository. * https://bugs.launchpad.net/neutron/+bug/2011422: a more elaborated way to report the agent status. That could provide the start flag, the revived flag, the sync processing flag and many other ones that could be defined only for this specific agent. Regards. On Mon, Mar 20, 2023 at 4:33 PM Jan Horstmann <J.Horstmann@mittwald.de> wrote:
On Wed, 2023-03-15 at 16:10 +0000, Felix Hüttner wrote:
Hi,
Subject: Re: [neutron] detecting l3-agent readiness
Hi,
Hi Mohammed,
Subject: [neutron] detecting l3-agent readiness
Hi folks,
I'm working on improving the stability of rollouts when using Kubernetes as a control
We build such a solution here: https://gitlab.com/yaook/images/neutron-l3-agent/- /blob/devel/files/startup_wait_for_ns.py Basically we are checking against the neutron api what routers should be on the node and
Dnia poniedziałek, 13 marca 2023 16:35:43 CET Felix Hüttner pisze: plane, specifically around the L3 agent, it seems that I have not found a clear way to detect in the code path where the L3 agent has finished it's initial sync.. then validate that all keepalived processes are up and running.
That would work only for HA routers. If You would also have routers which aren't "ha" this method may fail.
Yep, since we only have HA routers that works fine for us. But I guess it should also work for non-ha routers without too much adoption (maybe just check for namespaces instead of keepalived).
Instead of counting processes I have been using the l3 agent's `configurations.routers` field to determine its readiness. From my understanding comparing this number with the number of active routers hosted by the agent should be a good indicator of its sync status. Using two api calls for this is inherently racy, but could be a sufficient workaround for environments with a moderate number of router events. So a simple test snippet for the sync status of all agents could be:
``` import sys import openstack client = openstack.connection.Connection( ... ) l3_agent_synced = [ len([ router for router in client.network.agent_hosted_routers(agent) if router.is_admin_state_up ]) <= client.network.get_agent(agent).configuration["routers"] for agent in client.network.agents() if agent.agent_type == "L3 agent" and (agent.configuration["agent_mode"] == "dvr_snat" or agent.configuration["agent_mode"] == "legacy") ] if not all(l3_agent_synced): sys.exit(1) ```
Please let me know if I am way off with this approach :)
Am I missing it somewhere or is the architecture built in a way
Adding a option in the neutron api would be a lot nicer. But i guess
that doesn't really answer that question? that also counts for l2 and dhcp agents.
Thanks Mohammed
-- Mohammed Naser VEXXHOST, Inc.
-- Felix Huettner Diese E Mail enthält möglicherweise vertrauliche Inhalte und ist nur
für die Verwertung durch den vorgesehenen Empfänger bestimmt. Sollten Sie nicht der vorgesehene Empfänger sein, setzen Sie den Absender bitte unverzüglich in Kenntnis und löschen diese E Mail. Hinweise zum Datenschutz finden Sie hier< https://www.datenschutz.schwarz>.
-- Slawek Kaplonski Principal Software Engineer Red Hat
-- Felix Huettner Diese E Mail enthält möglicherweise vertrauliche Inhalte und ist nur für die Verwertung durch den vorgesehenen Empfänger bestimmt. Sollten Sie nicht der vorgesehene Empfänger sein, setzen Sie den Absender bitte unverzüglich in Kenntnis und löschen diese E Mail. Hinweise zum Datenschutz finden Sie hier<https://www.datenschutz.schwarz>.
-- Jan Horstmann
From: Rodolfo Alonso Hernandez <ralonsoh@redhat.com> Date: Monday, March 20, 2023 at 12:09 PM To: Jan Horstmann <J.Horstmann@mittwald.de> Cc: Mohammed Naser <mnaser@vexxhost.com>, felix.huettner@mail.schwarz <felix.huettner@mail.schwarz>, skaplons@redhat.com <skaplons@redhat.com>, openstack-discuss@lists.openstack.org <openstack-discuss@lists.openstack.org> Subject: Re: [neutron] detecting l3-agent readiness Hello: I think I'm repeating myself here but we have two approaches to solve this problem: * The easiest one, at least for the L3 agent, is to report an INFO level log before and after the full sync. That could be parsed by any tool to detect that. You can propose a patch to the Neutron repository. I’ve kicked this off with this: https://review.opendev.org/c/openstack/neutron/+/878248 fix: add log message for periodic_sync_routers_task fullsync [NEW] * https://bugs.launchpad.net/neutron/+bug/2011422<https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.launchpad.net%2Fneutron%2F%2Bbug%2F2011422&data=05%7C01%7Cmnaser%40vexxhost.com%7Ceb8b063ae0584c99f7a408db295d7b95%7C54e2b12264054dafa35bf65edc45c621%7C0%7C0%7C638149253703931245%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=MFnN5FJXrtHIctgceAu5gxc8dcZoVbXFyd0RSwCQqH4%3D&reserved=0>: a more elaborated way to report the agent status. That could provide the start flag, the revived flag, the sync processing flag and many other ones that could be defined only for this specific agent. Regards. On Mon, Mar 20, 2023 at 4:33 PM Jan Horstmann <J.Horstmann@mittwald.de<mailto:J.Horstmann@mittwald.de>> wrote: On Wed, 2023-03-15 at 16:10 +0000, Felix Hüttner wrote:
Hi,
Subject: Re: [neutron] detecting l3-agent readiness
Hi,
Hi Mohammed,
Subject: [neutron] detecting l3-agent readiness
Hi folks,
I'm working on improving the stability of rollouts when using Kubernetes as a control
We build such a solution here: https://gitlab.com/yaook/images/neutron-l3-agent/-<https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.com%2Fyaook%2Fimages%2Fneutron-l3-agent%2F-&data=05%7C01%7Cmnaser%40vexxhost.com%7Ceb8b063ae0584c99f7a408db295d7b95%7C54e2b12264054dafa35bf65edc45c621%7C0%7C0%7C638149253703931245%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=eI3UC%2FIfb9TuoPdy1xrmCOENIUrTiNndqMmx98J0u5s%3D&reserved=0> /blob/devel/files/startup_wait_for_ns.py Basically we are checking against the neutron api what routers should be on the node and
Dnia poniedziałek, 13 marca 2023 16:35:43 CET Felix Hüttner pisze: plane, specifically around the L3 agent, it seems that I have not found a clear way to detect in the code path where the L3 agent has finished it's initial sync.. then validate that all keepalived processes are up and running.
That would work only for HA routers. If You would also have routers which aren't "ha" this method may fail.
Yep, since we only have HA routers that works fine for us. But I guess it should also work for non-ha routers without too much adoption (maybe just check for namespaces instead of keepalived).
Instead of counting processes I have been using the l3 agent's `configurations.routers` field to determine its readiness. From my understanding comparing this number with the number of active routers hosted by the agent should be a good indicator of its sync status. Using two api calls for this is inherently racy, but could be a sufficient workaround for environments with a moderate number of router events. So a simple test snippet for the sync status of all agents could be: ``` import sys import openstack client = openstack.connection.Connection( ... ) l3_agent_synced = [ len([ router for router in client.network.agent_hosted_routers(agent) if router.is_admin_state_up ]) <= client.network.get_agent(agent).configuration["routers"] for agent in client.network.agents() if agent.agent_type == "L3 agent" and (agent.configuration["agent_mode"] == "dvr_snat" or agent.configuration["agent_mode"] == "legacy") ] if not all(l3_agent_synced): sys.exit(1) ``` Please let me know if I am way off with this approach :)
Am I missing it somewhere or is the architecture built in a way that doesn't really
answer that question?
Adding a option in the neutron api would be a lot nicer. But i guess that also counts for l2 and dhcp agents.
Thanks Mohammed
-- Mohammed Naser VEXXHOST, Inc.
-- Felix Huettner Diese E Mail enthält möglicherweise vertrauliche Inhalte und ist nur für die Verwertung durch den vorgesehenen Empfänger bestimmt. Sollten Sie nicht der vorgesehene Empfänger sein, setzen Sie den Absender bitte unverzüglich in Kenntnis und löschen diese E Mail. Hinweise zum Datenschutz finden Sie hier<https://www.datenschutz.schwarz<https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.datenschutz.schwarz%2F&data=05%7C01%7Cmnaser%40vexxhost.com%7Ceb8b063ae0584c99f7a408db295d7b95%7C54e2b12264054dafa35bf65edc45c621%7C0%7C0%7C638149253703931245%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=tfDvMmUFLZV2JbmqqeVlQq%2FzoWTRqNVrgQdKyeuCWOc%3D&reserved=0>>.
-- Slawek Kaplonski Principal Software Engineer Red Hat
-- Felix Huettner Diese E Mail enthält möglicherweise vertrauliche Inhalte und ist nur für die Verwertung durch den vorgesehenen Empfänger bestimmt. Sollten Sie nicht der vorgesehene Empfänger sein, setzen Sie den Absender bitte unverzüglich in Kenntnis und löschen diese E Mail. Hinweise zum Datenschutz finden Sie hier<https://www.datenschutz.schwarz<https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.datenschutz.schwarz%2F&data=05%7C01%7Cmnaser%40vexxhost.com%7Ceb8b063ae0584c99f7a408db295d7b95%7C54e2b12264054dafa35bf65edc45c621%7C0%7C0%7C638149253703931245%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=tfDvMmUFLZV2JbmqqeVlQq%2FzoWTRqNVrgQdKyeuCWOc%3D&reserved=0>>.
-- Jan Horstmann
Hello, Interesting thread! We are also interested in this for use when we are upgrading services, we are currently doing our best to parse the logs but that’s only for OVS agent and I was going to look into this. I can imagine having something like this for containers would be crucial as well. Best regards Tobias
On 12 Mar 2023, at 11:09, Mohammed Naser <mnaser@vexxhost.com> wrote:
Hi folks,
I'm working on improving the stability of rollouts when using Kubernetes as a control plane, specifically around the L3 agent, it seems that I have not found a clear way to detect in the code path where the L3 agent has finished it's initial sync..
Am I missing it somewhere or is the architecture built in a way that doesn't really answer that question?
Thanks Mohammed
-- Mohammed Naser VEXXHOST, Inc.
On 13/03/2023 19:46, Tobias Urdin wrote:
Interesting thread!
+1 Most installations run into this issue of wondering when a network node is really ready / fully synced. While the tooling that Mohammed or Felix does work in "observing" or "determining" the sync state independently, I strongly believe a network agent should report it's sync state back to the control plane. Orchestration of e.g. rolling upgrades of agents should be possible with state information provided by neutron itself and not require external tooling. By implementing the state data structure and then having the drivers (OVN, OVS, linuxbridge) report this back, this is independent from the particular implementation details (network NS, certain processes running, ...). Looking at this problem the taints and tolerations model use for node "readiness" from Kubernetes come to mind (https://kubernetes.io/docs/reference/labels-annotations-taints/#node-kuberne...). Regards Christian
participants (7)
-
Christian Rohmann
-
Felix Hüttner
-
Jan Horstmann
-
Mohammed Naser
-
Rodolfo Alonso Hernandez
-
Slawek Kaplonski
-
Tobias Urdin