[neutron][largescale-sig] Debugging and tracking missing flows with l2pop
Arnaud Morin
arnaud.morin at gmail.com
Thu Mar 12 06:46:49 UTC 2020
Hey Krzysztof,
In my company we dont use l2pop, I remember that it has some downsides
when scaling a lot (more that 1k computes in a region) but I dont
remember the details.
Anyway, our agent is based on an OVS Agent, which is also using OpenFlow
rules.
We do monitor the openflow rules out of neutron with custom tools.
We do that mainly for 2 reasons:
- we want to make sure that neutron wont leak any rule, this could be
very harmful
- we want to make sure that neutron did not miss any rule when
configuring a specific port, which could lead a broken network
connection for our clients.
We track the missing openflow rules on the compute itself, because we
dont want to rely on a centralized system for that. So, to do that, we
found a way to pull information about ports on the compute itself, from
neutron server and database.
Cheers,
--
Arnaud Morin
On 11.03.20 - 14:29, Krzysztof Klimonda wrote:
> Hi,
>
> (This is stein deployment with 14.0.2 neutron release)
>
> I’ve just spent some time debugging a missing connection between two VMs running on OS stein with ovs+l2pop enabled and the direct cause was missing flows in table 20 and a very incomplete flood flow in table 22. Restarting neutron-openvswitch-agent on that host has fixed the issue.
>
> Last time we’ve encountered missing flood flows (in another pike-based deployment), we tracked it down to https://review.opendev.org/#/c/600151/ and since then it was stable.
>
> My initial thought was that we were hitting the same bug - a couple of VMs are scheduled on the same compute, 3 ports are activated at the same time, and the flood entry is not broadcasted to other computes. However that issue was only affecting one of the computes, and it was the only one missing both MAC entries in table 20 and VXLAN tunnels in table 22.
>
> The only other idea I have is that the compute with missing flows have not received them from rabbitmq, but there I see nothing in logs that would suggest that agent was disconnected from rabbitmq.
>
> So at this point I have three questions:
>
> - what would be a good place to look next to track down those missing flows
> - for other operators, how stable do you find l2pop in general? and if you have problems with missing flows in your environment, do you try to monitor your deployment for that?
>
> -Chris
More information about the openstack-discuss
mailing list