[neutron][largescale-sig] Debugging and tracking missing flows with l2pop
kklimonda at syntaxhighlighted.com
Thu Mar 12 13:12:30 UTC 2020
Do your tools query neutron for ports, or do you query the database directly? I’m a bit concerned about having ~100 nodes query neutron for a list of ports and flows every minute or so, and how much extra load will that add on our neutron-server.
What do you mean by neutron leaking rules? Is it security group rules that you are concerned about?
> On 12 Mar 2020, at 07:46, Arnaud Morin <arnaud.morin at gmail.com> wrote:
> Hey Krzysztof,
> In my company we dont use l2pop, I remember that it has some downsides
> when scaling a lot (more that 1k computes in a region) but I dont
> remember the details.
> Anyway, our agent is based on an OVS Agent, which is also using OpenFlow
> We do monitor the openflow rules out of neutron with custom tools.
> We do that mainly for 2 reasons:
> - we want to make sure that neutron wont leak any rule, this could be
> very harmful
> - we want to make sure that neutron did not miss any rule when
> configuring a specific port, which could lead a broken network
> connection for our clients.
> We track the missing openflow rules on the compute itself, because we
> dont want to rely on a centralized system for that. So, to do that, we
> found a way to pull information about ports on the compute itself, from
> neutron server and database.
> Arnaud Morin
> On 11.03.20 - 14:29, Krzysztof Klimonda wrote:
>> (This is stein deployment with 14.0.2 neutron release)
>> I’ve just spent some time debugging a missing connection between two VMs running on OS stein with ovs+l2pop enabled and the direct cause was missing flows in table 20 and a very incomplete flood flow in table 22. Restarting neutron-openvswitch-agent on that host has fixed the issue.
>> Last time we’ve encountered missing flood flows (in another pike-based deployment), we tracked it down to https://review.opendev.org/#/c/600151/ and since then it was stable.
>> My initial thought was that we were hitting the same bug - a couple of VMs are scheduled on the same compute, 3 ports are activated at the same time, and the flood entry is not broadcasted to other computes. However that issue was only affecting one of the computes, and it was the only one missing both MAC entries in table 20 and VXLAN tunnels in table 22.
>> The only other idea I have is that the compute with missing flows have not received them from rabbitmq, but there I see nothing in logs that would suggest that agent was disconnected from rabbitmq.
>> So at this point I have three questions:
>> - what would be a good place to look next to track down those missing flows
>> - for other operators, how stable do you find l2pop in general? and if you have problems with missing flows in your environment, do you try to monitor your deployment for that?
More information about the openstack-discuss