Not getting full bandwidth VXLAN + DVR

Sean Mooney smooney at
Fri Jan 25 11:11:12 UTC 2019

On Fri, 2019-01-25 at 10:14 +0100, Yedhu Sastri wrote:
> Hello,
> In our OpenStack environment(Newton) we are using 10G network in all our nodes. We are using OVS bridging with VXLAN
> tunneling and DVR. We also enabled Jumbo frames in NIC and also in physical switches. We also enabled VXLAN offloading
> in our NIC. irqbalance is running which suppose to distribute the network irqs to all cores of the CPU. But
> unfortunately we are only getting below 1G bandwidth when communicate with our VM's with floating IP's from compute
> hosts. We tested it using iperf and results are like
> Host to VM using floating IP - less than 1Gbits/sec
sorry for the complexity of the diagram but with dvr your networking will look something like this

the diagram is actully incorrect in that there should not be a line between port interface 3 and interface 3
as that imply you add a phyical nic to the br-tun which is not correct.

when iperf connects via the internet the ingress flow is described here
but i will summarise below.

looking at the simplifed diagram

1 packet arrives in datacenter wan uplink and is switch to your wan router.
the wan router haveing connectivity to the subnet of your floating ip
generate an arp to discover the mac of the floatin ip.
the arp request ingress the compute node on interface 2 enters the br-provider bridge
(usually called br-ex) crosses the patch port to the br-int where it exits via fg port
labled 6 and enters the fip namespace via the tap device created by ovs labled 7 in the diagram.

this tap has the floating ip assigned so it responds to the arp which when recived by
your wan router trigers it to lean the dset mac and route the tcp stream from your iperf client to that mac.
the iperf traffic takes teh same path to fip namespace where it is intercepted by an iptables DNAT
rule which updates the destination ip to the private ip and it is sent to the dvr namespace by a veth pair
labled 8 and 9 in the diagram. once the packet is recived in the dvr namespace it is routed to ovs via
interface 10 after doing a similar arp request to learn the dest mac for the private ip.
the packet ender the br-int and if you are using the iptable firewall dirver exit via another veth pair and enter 
linux bridge which the vms tab device is connected to and finally gets to the vm.
not if you are using the conntrack or noop firewall driver the vm tap is added to the br-int directly
so the qbr linux bridge and the veth pair shown as 12 and 13 will not exist.

finally the vm recived the iperf connect and the reponce packets are sent backward through the same path.

so i went thorugh that flow for 2 reasons. first in the north south path the network encapsulation used for the teant
network e.g. vxlan is irellevent as the packet is never vxlan encapsulated. second there are several places where there
could be botelnecks.

first are you using 10G nics for the br-ex/br-provider bridge?
second is the local tunnel endpoint ip assinged to this bridge.?

the answer should be yes to both and i will procedd as if the answer is yes.
if you are not using a 10G nic for the br-ex then that is why you are seeing sub 1G speeds

next you mention you are using jumbo frames. assuming you are using 9000 byte
mtu then i would expect the mtu of the neutorn vxlan network to be 8950.

in this case looking at
again you should check the mtus are set correctly at the following locations.

interfce 2 should be set to your phyical network mtu which im assuming is 9000 in this example
interfces 15(vm tap), 14(qbr bridge) 13,(qvb veth interface) and 10(qr port in dvr namespace)
should all have there mtu set to 8950.
interface 9(rfp), 8(fpr) and 7(fg) should be set to 9000.

when you do your testing with iperf you should be setting you mtu or packet size to 8950.
if you use 9000 it will force the tcp packets to be segmented when it is routed from the rfp
interface tothe qr interface in the dvr namespace which will requrie the vm to reassmeble it later. 
this is the first bottelneck you will need to ensure is not present. when you are doing the vm to vm
testing it will use an mtu of 8950 as that is the mtu of the neutron network and is
included in the dhcp reply.

if you have validated that the mtus are set correctly the next stpe is to determin if packet are bing droped

to do this you need to check interface 16(the vm interface in the vm)
15 (the vm tap on the host) 13/12 the veth between ovs and the linux bridge 
10(the dvr interface) 9/8 (the veth between the  fip and dvr namespace) 7( the floatin ip gateway port on ovs)
and finally 2 the uplink to the physical network.

if you see packet loss on vm on either port 16 or 15 you can try to enable multi queue for the virtio interface
you do that by setting hw_vif_multiqueue_enabled=true in the image metadata and then
enabling multiqueu in the guest with ethtool -L <NIC> combined #num_of_queues.

if the packet loss is observed on teh veth between the linux bridge and ovs (13/12) then
you could change form the ip tables firewall to conntrack or noop firewall driver.

if the bottleneck is in the dvr router namespace between port 10 and 9 and its not cause by ip fragmentaiton
then you are hitting a kernel limitaion and you will need to tune the kernel to improve routing performance.

if the packet loss is betwen 8 and 7 you are hitting a linux kernel dnat bottleneck.
again some kernel option may be able to optimise this but there is not much you can do.

if the pakcet loss is in RX on interfce 2 you need to ensure the Rescive side scalingin is enable and the 
nic is configured to use muliptle quese ethtool -L <NIC> combined #num_of_queues. you should also ensure
that offload such a LRO are enabled if availabel on your nic.

if none of the above help then your only recorse is to evaluate other neturon netowkring solution such
as OVN which will implement dvr/fip/nat using openflow rules in ovs.

i hope this helps.

> VM to VM using internal IP -  ~2.5Gbits/sec
> Any idea or solutions to solve this issue is much appreciated. 

More information about the openstack-discuss mailing list