[neutron][largescale-sig] Debugging and tracking missing flows with l2pop

Arnaud Morin arnaud.morin at gmail.com
Thu Mar 12 06:46:49 UTC 2020


Hey Krzysztof,

In my company we dont use l2pop, I remember that it has some downsides
when scaling a lot (more that 1k computes in a region) but I dont
remember the details.

Anyway, our agent is based on an OVS Agent, which is also using OpenFlow
rules.
We do monitor the openflow rules out of neutron with custom tools.
We do that mainly for 2 reasons:
- we want to make sure that neutron wont leak any rule, this could be
  very harmful
- we want to make sure that neutron did not miss any rule when
  configuring a specific port, which could lead a broken network
  connection for our clients.

We track the missing openflow rules on the compute itself, because we
dont want to rely on a centralized system for that. So, to do that, we
found a way to pull information about ports on the compute itself, from
neutron server and database.

Cheers,

-- 
Arnaud Morin

On 11.03.20 - 14:29, Krzysztof Klimonda wrote:
> Hi,
> 
> (This is stein deployment with 14.0.2 neutron release)
> 
> I’ve just spent some time debugging a missing connection between two VMs running on OS stein with ovs+l2pop enabled and the direct cause was missing flows in table 20 and a very incomplete flood flow in table 22. Restarting neutron-openvswitch-agent on that host has fixed the issue.
> 
> Last time we’ve encountered missing flood flows (in another pike-based deployment), we tracked it down to https://review.opendev.org/#/c/600151/ and since then it was stable. 
> 
> My initial thought was that we were hitting the same bug - a couple of VMs are scheduled on the same compute, 3 ports are activated at the same time, and the flood entry is not broadcasted to other computes. However that issue was only affecting one of the computes, and it was the only one missing both MAC entries in table 20 and VXLAN tunnels in table 22.
> 
> The only other idea I have is that the compute with missing flows have not received them from rabbitmq, but there I see nothing in logs that would suggest that agent was disconnected from rabbitmq. 
> 
> So at this point I have three questions:
> 
> - what would be a good place to look next to track down those missing flows
> - for other operators, how stable do you find l2pop in general? and if you have problems with missing flows in your environment, do you try to monitor your deployment for that?
> 
> -Chris



More information about the openstack-discuss mailing list