[neutron][largescale-sig] Debugging and tracking missing flows with l2pop
Hi, (This is stein deployment with 14.0.2 neutron release) I’ve just spent some time debugging a missing connection between two VMs running on OS stein with ovs+l2pop enabled and the direct cause was missing flows in table 20 and a very incomplete flood flow in table 22. Restarting neutron-openvswitch-agent on that host has fixed the issue. Last time we’ve encountered missing flood flows (in another pike-based deployment), we tracked it down to https://review.opendev.org/#/c/600151/ and since then it was stable. My initial thought was that we were hitting the same bug - a couple of VMs are scheduled on the same compute, 3 ports are activated at the same time, and the flood entry is not broadcasted to other computes. However that issue was only affecting one of the computes, and it was the only one missing both MAC entries in table 20 and VXLAN tunnels in table 22. The only other idea I have is that the compute with missing flows have not received them from rabbitmq, but there I see nothing in logs that would suggest that agent was disconnected from rabbitmq. So at this point I have three questions: - what would be a good place to look next to track down those missing flows - for other operators, how stable do you find l2pop in general? and if you have problems with missing flows in your environment, do you try to monitor your deployment for that? -Chris
Hey Krzysztof, In my company we dont use l2pop, I remember that it has some downsides when scaling a lot (more that 1k computes in a region) but I dont remember the details. Anyway, our agent is based on an OVS Agent, which is also using OpenFlow rules. We do monitor the openflow rules out of neutron with custom tools. We do that mainly for 2 reasons: - we want to make sure that neutron wont leak any rule, this could be very harmful - we want to make sure that neutron did not miss any rule when configuring a specific port, which could lead a broken network connection for our clients. We track the missing openflow rules on the compute itself, because we dont want to rely on a centralized system for that. So, to do that, we found a way to pull information about ports on the compute itself, from neutron server and database. Cheers, -- Arnaud Morin On 11.03.20 - 14:29, Krzysztof Klimonda wrote:
Hi,
(This is stein deployment with 14.0.2 neutron release)
I’ve just spent some time debugging a missing connection between two VMs running on OS stein with ovs+l2pop enabled and the direct cause was missing flows in table 20 and a very incomplete flood flow in table 22. Restarting neutron-openvswitch-agent on that host has fixed the issue.
Last time we’ve encountered missing flood flows (in another pike-based deployment), we tracked it down to https://review.opendev.org/#/c/600151/ and since then it was stable.
My initial thought was that we were hitting the same bug - a couple of VMs are scheduled on the same compute, 3 ports are activated at the same time, and the flood entry is not broadcasted to other computes. However that issue was only affecting one of the computes, and it was the only one missing both MAC entries in table 20 and VXLAN tunnels in table 22.
The only other idea I have is that the compute with missing flows have not received them from rabbitmq, but there I see nothing in logs that would suggest that agent was disconnected from rabbitmq.
So at this point I have three questions:
- what would be a good place to look next to track down those missing flows - for other operators, how stable do you find l2pop in general? and if you have problems with missing flows in your environment, do you try to monitor your deployment for that?
-Chris
Thanks. Do your tools query neutron for ports, or do you query the database directly? I’m a bit concerned about having ~100 nodes query neutron for a list of ports and flows every minute or so, and how much extra load will that add on our neutron-server. What do you mean by neutron leaking rules? Is it security group rules that you are concerned about? -Chris
On 12 Mar 2020, at 07:46, Arnaud Morin <arnaud.morin@gmail.com> wrote:
Hey Krzysztof,
In my company we dont use l2pop, I remember that it has some downsides when scaling a lot (more that 1k computes in a region) but I dont remember the details.
Anyway, our agent is based on an OVS Agent, which is also using OpenFlow rules. We do monitor the openflow rules out of neutron with custom tools. We do that mainly for 2 reasons: - we want to make sure that neutron wont leak any rule, this could be very harmful - we want to make sure that neutron did not miss any rule when configuring a specific port, which could lead a broken network connection for our clients.
We track the missing openflow rules on the compute itself, because we dont want to rely on a centralized system for that. So, to do that, we found a way to pull information about ports on the compute itself, from neutron server and database.
Cheers,
-- Arnaud Morin
On 11.03.20 - 14:29, Krzysztof Klimonda wrote:
Hi,
(This is stein deployment with 14.0.2 neutron release)
I’ve just spent some time debugging a missing connection between two VMs running on OS stein with ovs+l2pop enabled and the direct cause was missing flows in table 20 and a very incomplete flood flow in table 22. Restarting neutron-openvswitch-agent on that host has fixed the issue.
Last time we’ve encountered missing flood flows (in another pike-based deployment), we tracked it down to https://review.opendev.org/#/c/600151/ and since then it was stable.
My initial thought was that we were hitting the same bug - a couple of VMs are scheduled on the same compute, 3 ports are activated at the same time, and the flood entry is not broadcasted to other computes. However that issue was only affecting one of the computes, and it was the only one missing both MAC entries in table 20 and VXLAN tunnels in table 22.
The only other idea I have is that the compute with missing flows have not received them from rabbitmq, but there I see nothing in logs that would suggest that agent was disconnected from rabbitmq.
So at this point I have three questions:
- what would be a good place to look next to track down those missing flows - for other operators, how stable do you find l2pop in general? and if you have problems with missing flows in your environment, do you try to monitor your deployment for that?
-Chris
Hello, We had the same concerns as you, so that's why, we have another API which is in front of our databases (on read node). This API is designed to do read-only in DB, and bypass openstack API. We do that because sometimes the OpenStack API calls are less performants or does not give the info in a way we would like. So to avoid doing multiples calls to retrieve one info, we built this custom internal API. Our tool to check the OpenFlow rules on the compute is calling this API to get the neutron ports info. By neutron leaking rules, I mean openflow rules. Cheers, -- Arnaud Morin On 12.03.20 - 14:12, Krzysztof Klimonda wrote:
Thanks.
Do your tools query neutron for ports, or do you query the database directly? I’m a bit concerned about having ~100 nodes query neutron for a list of ports and flows every minute or so, and how much extra load will that add on our neutron-server.
What do you mean by neutron leaking rules? Is it security group rules that you are concerned about?
-Chris
On 12 Mar 2020, at 07:46, Arnaud Morin <arnaud.morin@gmail.com> wrote:
Hey Krzysztof,
In my company we dont use l2pop, I remember that it has some downsides when scaling a lot (more that 1k computes in a region) but I dont remember the details.
Anyway, our agent is based on an OVS Agent, which is also using OpenFlow rules. We do monitor the openflow rules out of neutron with custom tools. We do that mainly for 2 reasons: - we want to make sure that neutron wont leak any rule, this could be very harmful - we want to make sure that neutron did not miss any rule when configuring a specific port, which could lead a broken network connection for our clients.
We track the missing openflow rules on the compute itself, because we dont want to rely on a centralized system for that. So, to do that, we found a way to pull information about ports on the compute itself, from neutron server and database.
Cheers,
-- Arnaud Morin
On 11.03.20 - 14:29, Krzysztof Klimonda wrote:
Hi,
(This is stein deployment with 14.0.2 neutron release)
I’ve just spent some time debugging a missing connection between two VMs running on OS stein with ovs+l2pop enabled and the direct cause was missing flows in table 20 and a very incomplete flood flow in table 22. Restarting neutron-openvswitch-agent on that host has fixed the issue.
Last time we’ve encountered missing flood flows (in another pike-based deployment), we tracked it down to https://review.opendev.org/#/c/600151/ and since then it was stable.
My initial thought was that we were hitting the same bug - a couple of VMs are scheduled on the same compute, 3 ports are activated at the same time, and the flood entry is not broadcasted to other computes. However that issue was only affecting one of the computes, and it was the only one missing both MAC entries in table 20 and VXLAN tunnels in table 22.
The only other idea I have is that the compute with missing flows have not received them from rabbitmq, but there I see nothing in logs that would suggest that agent was disconnected from rabbitmq.
So at this point I have three questions:
- what would be a good place to look next to track down those missing flows - for other operators, how stable do you find l2pop in general? and if you have problems with missing flows in your environment, do you try to monitor your deployment for that?
-Chris
participants (2)
-
Arnaud Morin
-
Krzysztof Klimonda