Re: OVS-DPDK poor performance with Intel 82599

30 Nov 2020

      Good morning sean,

Yes I do have multi-queue enabled and on guests by default I am seeing
8 queues. see following output.

[root@c8-dpdk ~]# ethtool -l eth0
Channel parameters for eth0:
Pre-set maximums:
RX: 0
TX: 0
Other: 0
Combined: 8
Current hardware settings:
RX: 0
TX: 0
Other: 0
Combined: 8

If you noticed i have 8 queue on VM but only 4 queue is active and
other 4 doing nothing i believe because i have only 4 PMD threads and
my vswitch is configured for (configured_rx_queues=4)

[root@compute-lxb-3 ~]# ovs-vsctl set interface dpdk-1 options:n_rxq=4

Here is the real-time VM queue stats

[root@c8-dpdk ~]# ethtool -S eth0
NIC statistics:
     rx_queue_0_packets: 0
     rx_queue_0_bytes: 0
     rx_queue_1_packets: 369536869
     rx_queue_1_bytes: 262805118523
     rx_queue_2_packets: 106161912
     rx_queue_2_bytes: 18047524850
     rx_queue_3_packets: 106111697
     rx_queue_3_bytes: 18038988269
     rx_queue_4_packets: 368173238
     rx_queue_4_bytes: 262783235737
     rx_queue_5_packets: 0
     rx_queue_5_bytes: 0
     rx_queue_6_packets: 0
     rx_queue_6_bytes: 0
     rx_queue_7_packets: 0
     rx_queue_7_bytes: 0
     tx_queue_0_packets: 48
     tx_queue_0_bytes: 7225
     tx_queue_1_packets: 277191537
     tx_queue_1_bytes: 64323249610
     tx_queue_2_packets: 308945384
     tx_queue_2_bytes: 66342845029
     tx_queue_3_packets: 116130494
     tx_queue_3_bytes: 31497220898
     tx_queue_4_packets: 4599
     tx_queue_4_bytes: 280111
     tx_queue_5_packets: 1068
     tx_queue_5_bytes: 89631
     tx_queue_6_packets: 1106
     tx_queue_6_bytes: 83642
     tx_queue_7_packets: 108513710
     tx_queue_7_bytes: 31040692056

I am noticing during load-test CPU interrupt going high on VM guest so
i believe my CPU running out of gas (because on VM i am not running
DPDK application and every single packet getting processed by kernel,
do you think that could be issue)

[root@c8-dpdk ~]# vmstat -n 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 3244980   2668 400820    0    0     0     0   12    5  0
0 100  0  0
 0  0      0 3243828   2668 400820    0    0     0     0 95518 2990  0
24 76  0  0
 0  0      0 3243988   2668 400820    0    0     0     0 95417 2986  0
24 76  0  0
 0  0      0 3244884   2668 400836    0    0     0     0 90471 2901  0
23 77  0  0
 0  0      0 3244756   2668 400836    0    0     0     0 84658 2733  0
24 76  0  0
 0  0      0 3244468   2668 400836    0    0     0     0 95634 2974  0
24 76  0  0
 0  0      0 3243732   2668 400836    0    0     0     0 95033 3010  0
24 76  0  0
 0  0      0 3244276   2668 400836    0    0     0     0 94840 2986  0
24 76  0  0
 0  0      0 3244596   2668 400836    0    0     0     0 98760 3114  0
24 76  0  0
 0  0      0 3244284   2668 400836    0    0     0     0 88780 2795  0
24 76  0  0
 0  0      0 3244956   2668 400836    0    0     0     0 87878 2756  0
24 76  0  0

I am getting the same result on the SR-IOV guest VM (it has nothing to
do with DPDK) . I believe it is my NIC or kernel limitation. Intel
82599 only support 2 queue for VF check out this link
https://community.intel.com/t5/Ethernet-Products/Intel-NIC-82599-EB-enable-S...

Currently my Trex is sending packet to port0 and receiving packet from
port1 (my VM guest just forwarding packet from eth0 to eth1). Do you
think i should use testpmd on guest VM to boost packet forwarding from
interface a to b ?

On Mon, Nov 30, 2020 at 7:07 AM Sean Mooney <smooney@redhat.com> wrote:
...
On Fri, 2020-11-27 at 18:19 -0500, Satish Patel wrote:
...
Sean,
Here is the full list of requested output :
http://paste.openstack.org/show/800515/
In the above output I am noticing ovs_tx_failure_drops=38000 and at
the same time Trex also showing me the same number of drops in its
result  I will try to dig into ARP flood, you are saying ARP flood
will be inside OVS switch right?
looking at the packet stats for the other interfaces the switch is not flooing the
traffic.
the vhu port  stats for vhu8710dd6b-e0 indicate that the vm is not reciving packets fast enough.
so the vswitch (ovs-dpdk) needs to drop the packets.
{ovs_rx_qos_drops=0, ovs_tx_failure_drops=38000, ovs_tx_invalid_hwol_drops=0, ovs_tx_mtu_exceeded_drops=0, ovs_tx_qos_drops=0, ovs_tx_retries=91,
rx_1024_to_1522_packets=2, rx_128_to_255_packets=20, rx_1523_to_max_packets=0, rx_1_to_64_packets=27, rx_256_to_511_packets=24,
rx_512_to_1023_packets=1, rx_65_to_127_packets=2379, rx_bytes=174105, rx_dropped=0, rx_errors=0, rx_packets=2453, tx_bytes=1088003898,
tx_dropped=38000, tx_packets=17991999}
based on num_of_vrings="16" i assume yo have enabeld virtio multi queue.
have you also enabel that within the guest?
older version of dpdk did not do negociation of which queue to enabel to if you did not enable all the queus in the guest it would cause really poor
performance.
in this case you have 8rx queues and 8 tx queues so you will want to ensure that trex in the the guest is using all 8 queues.
dropping packets in dpdk is very expensive so even moderate drop rates negitivly impact its perfromace  more then you would expect.
each drop is baciclly a branch prodictor miss and cause the procesor to role back all its peculitive execution it also likely a cache miss as
the drop code is marked as cold in the code so ya drop in dpdk are expensive as a result.
...
any command or any specific thing i
should look for?
On Fri, Nov 27, 2020 at 8:49 AM Sean Mooney <smooney@redhat.com> wrote:
...
...
Sean,
Let me say "Happy Thanksgiving to you and your family". Thank you for
taking time and reply, the last 2 days I was trying to find you on IRC
to discuss this issue. Let me explain to you what I did so far.
* First i did load-testing on my bare metal compute node to see how
far my Trex can go and i found it Hit 2 million packet per second (Not
sure if this is good result or not but it prove that i can hit at
least 1 million pps)
for 64byte packets on that nic it should be hittihng about 11mpps on one core.
On Thu, 2020-11-26 at 22:10 -0500, Satish Patel wrote:
that said i have not validated that in a year or two but it could eaisly saturate 10G linerate with 64b
packest with 1 cores in the past.
...
* Then i create sriov VM on that compute node with ( 8vCPU/8GB mem)
and i re-run Trex and my max result was 323kpps without dropping
packet)  I found Intel 82599 nic VF only support 2 queue rx/tx and
that could be bottleneck)
a VF can fully saturate the nic and hit 14.4 mpps if your cpu clock rate is fstat enough
i.e. >3.2-3.5GHz on a 2.5GHz you porably wont hit that with 1 core but you shoudl get >10mpps
...
* Finally i decided to build DPDK vm on it and see how Trex behaved on
it and i found it hit max ~400kpps with 4 PMD core. (little better
than sriov because now i have 4 rx/tx queue thanks to 4 PMD core)
ya so these number are far to low for a correctly complied and fuctioning trex binary
...
For Trex load-test i did statically assigned ARP entries because its
part of Trex process to use static arp.
...
You are saying it should hit
11 million pps but question is what tools you guys using to hit that
number i didn't see anyone using Trex for DPDK testing most of people
using testpmd.
that wont work properly. if you do that the ovs bridge will not have its mac learning
table populated so it will flood packets.  to do dpdk
trex is a trafic generaotr orginally created by cisco i think
it often used in combination with testpmd. testpmd was desing to these the pool
mode driver as the name implice but its also uses as a replacement for a device/appliction
under test to messure the low level performacne in l2/l3 forading modes or basic mac swap mode.
...
what kind of vm and (vCPU/memory people using to reach 11 million
pps?)
2-4 vcpus with 2G or ram.
if dpdk is compile propertly and funtionion you dont need a lot of core although you will  need
to use cpu pinning and hugepages for the vm and within the vm you will also need hugpeages if you are using dpdk there too.
...
I am stick to 8 vcpu because majority of my server has 8 core VM
size so trying to get most of performance out of it.)
If you have your load-test scenario available or tools available then
please share some information so i will try to mimic that in my
environment.  thank you for reply.
i think you need to start with getting trex to actully hit 10G linerate with small packets.
as i said you should not need more then about 2 cores to do that and 1-2 G of hugepages.
once you have tha tworking you can move on to the rest but you need to ensure proper mac learning happens and arps are sent and replied
too before starting the traffic generattor so that floodign does not happen.
can you also provide the output of
sudo ovs-vsctl list Open_vSwitch .
and the output of
sudo ovs-vsctl show, sudo ovs-vsctl list bridge, sudo ovs-vsctl list port and  sudo ovs-vsctl list interface
i just want to confirm that you have properly confiugred ovs-dpdk to use dpdk
i dont work with dpdk that offent any more but i generally used testpmd in the guest with an ixia hardware traffic generator
to do performance messurments. i have used trex and it can hit line rate so im not sure why you are seeign such low performance.
...
~S
On Thu, Nov 26, 2020 at 8:14 PM Sean Mooney <smooney@redhat.com> wrote:
...
...
Folks,
I am playing with DPDK on my openstack with NIC model 82599 and seeing
poor performance, i may be wrong with my numbers so want to see what
the community thinks about these results.
Compute node hardware:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Memory: 64G
NIC: Intel 82599 (dual 10G port)
[root@compute-lxb-3 ~]# ovs-vswitchd --version
ovs-vswitchd (Open vSwitch) 2.13.2
DPDK 19.11.3
VM dpdk (DUT):
8vCPU / 8GB memory
I have configured my computer node for all best practice available on
the internet to get more performance out.
1. Used isolcpus to islocate CPUs
2. 4 dedicated core for PMD
3. echo isolated_cores=1,9,25,33 >> /etc/tuned/cpu-partitioning-variables.conf
4. Huge pages
5. CPU pinning for VM
6. increase  ( ovs-vsctl set interface dpdk-1 options:n_rxq=4 )
7. VM virtio_ring = 1024
After doing all above I am getting the following result using the Trex
packet generator using 64B UDP stream (Total-PPS       :     391.93
Kpps)  Do you think it's an acceptable result or should it be higher
on these NIC models?
On Thu, 2020-11-26 at 16:56 -0500, Satish Patel wrote:
that is one of inteles oldest generation 10G nics that is supported by dpdk
but it shoudl still get to about 11 million packet per second with 1-2 cores
my guess would be that the vm or trafic gerneator are not sentding and reciving mac learnign
frames like arp properly and as a result the packtes are flooding which will severly
reduce perfomance.
...
On the internet folks say it should be a million packets per second so
not sure what and how those people reached there or i am missing
something in my load test profile.
even kernel ovs will break a million packets persecond so 400Kpps is far to low
there is sometin gmisconfigred but  im not sure what specificly form what you have shared.
as i said my best guess would be that the backets are flooding because the vm is not
responding to arp and the normal action is not learn the mac address.
you could rule that out by adding hardcoded rules but you could also check the flow tables to confirm
...
Notes: I am using 8vCPU core on VM do you think adding more cores will
help? OR should i add more PMD?
Cpu Utilization : 2.2  %  1.8 Gb/core
 Platform_factor : 1.0
 Total-Tx        :     200.67 Mbps
 Total-Rx        :     200.67 Mbps
 Total-PPS       :     391.93 Kpps
 Total-CPS       :     391.89 Kcps
Expected-PPS    :     700.00 Kpps
 Expected-CPS    :     700.00 Kcps
 Expected-BPS    :     358.40 Mbps
This is my all configuration:
grub.conf:
GRUB_CMDLINE_LINUX="vmalloc=384M crashkernel=auto
rd.lvm.lv=rootvg01/lv01 console=ttyS1,118200 rhgb quiet intel_iommu=on
iommu=pt spectre_v2=off nopti pti=off nospec_store_bypass_disable
spec_store_bypass_disable=off l1tf=off default_hugepagesz=1GB
hugepagesz=1G hugepages=60 transparent_hugepage=never selinux=0
isolcpus=2,3,4,5,6,7,10,11,12,13,14,15,26,27,28,29,30,31,34,35,36,37,38,39"
[root@compute-lxb-3 ~]# ovs-appctl dpif/show
netdev@ovs-netdev: hit:605860720 missed:2129
  br-int:
    br-int 65534/3: (tap)
    int-br-vlan 1/none: (patch: peer=phy-br-vlan)
    patch-tun 2/none: (patch: peer=patch-int)
    vhu1d64ea7d-d9 5/6: (dpdkvhostuserclient: configured_rx_queues=8,
configured_tx_queues=8, mtu=1500, requested_rx_queues=8,
requested_tx_queues=8)
    vhu9c32faf6-ac 6/7: (dpdkvhostuserclient: configured_rx_queues=8,
configured_tx_queues=8, mtu=1500, requested_rx_queues=8,
requested_tx_queues=8)
  br-tun:
    br-tun 65534/4: (tap)
    patch-int 1/none: (patch: peer=patch-tun)
    vxlan-0a410071 2/5: (vxlan: egress_pkt_mark=0, key=flow,
local_ip=10.65.0.114, remote_ip=10.65.0.113)
  br-vlan:
    br-vlan 65534/1: (tap)
    dpdk-1 2/2: (dpdk: configured_rx_queues=4,
configured_rxq_descriptors=2048, configured_tx_queues=5,
configured_txq_descriptors=2048, lsc_interrupt_mode=false, mtu=1500,
requested_rx_queues=4, requested_rxq_descriptors=2048,
requested_tx_queues=5, requested_txq_descriptors=2048,
rx_csum_offload=true, tx_tso_offload=false)
    phy-br-vlan 1/none: (patch: peer=int-br-vlan)
[root@compute-lxb-3 ~]# ovs-appctl dpif-netdev/pmd-rxq-show
pmd thread numa_id 0 core_id 1:
  isolated : false
  port: dpdk-1            queue-id:  0 (enabled)   pmd usage:  0 %
  port: vhu1d64ea7d-d9    queue-id:  3 (enabled)   pmd usage:  0 %
  port: vhu1d64ea7d-d9    queue-id:  4 (enabled)   pmd usage:  0 %
  port: vhu9c32faf6-ac    queue-id:  3 (enabled)   pmd usage:  0 %
  port: vhu9c32faf6-ac    queue-id:  4 (enabled)   pmd usage:  0 %
pmd thread numa_id 0 core_id 9:
  isolated : false
  port: dpdk-1            queue-id:  1 (enabled)   pmd usage:  0 %
  port: vhu1d64ea7d-d9    queue-id:  2 (enabled)   pmd usage:  0 %
  port: vhu1d64ea7d-d9    queue-id:  5 (enabled)   pmd usage:  0 %
  port: vhu9c32faf6-ac    queue-id:  2 (enabled)   pmd usage:  0 %
  port: vhu9c32faf6-ac    queue-id:  5 (enabled)   pmd usage:  0 %
pmd thread numa_id 0 core_id 25:
  isolated : false
  port: dpdk-1            queue-id:  3 (enabled)   pmd usage:  0 %
  port: vhu1d64ea7d-d9    queue-id:  0 (enabled)   pmd usage:  0 %
  port: vhu1d64ea7d-d9    queue-id:  7 (enabled)   pmd usage:  0 %
  port: vhu9c32faf6-ac    queue-id:  0 (enabled)   pmd usage:  0 %
  port: vhu9c32faf6-ac    queue-id:  7 (enabled)   pmd usage:  0 %
pmd thread numa_id 0 core_id 33:
  isolated : false
  port: dpdk-1            queue-id:  2 (enabled)   pmd usage:  0 %
  port: vhu1d64ea7d-d9    queue-id:  1 (enabled)   pmd usage:  0 %
  port: vhu1d64ea7d-d9    queue-id:  6 (enabled)   pmd usage:  0 %
  port: vhu9c32faf6-ac    queue-id:  1 (enabled)   pmd usage:  0 %
  port: vhu9c32faf6-ac    queue-id:  6 (enabled)   pmd usage:  0 %