OVS-DPDK poor performance with Intel 82599

Satish Patel satish.txt at gmail.com
Mon Nov 30 14:38:31 UTC 2020


Good morning sean,

Yes I do have multi-queue enabled and on guests by default I am seeing
8 queues. see following output.

[root at c8-dpdk ~]# ethtool -l eth0
Channel parameters for eth0:
Pre-set maximums:
RX: 0
TX: 0
Other: 0
Combined: 8
Current hardware settings:
RX: 0
TX: 0
Other: 0
Combined: 8

If you noticed i have 8 queue on VM but only 4 queue is active and
other 4 doing nothing i believe because i have only 4 PMD threads and
my vswitch is configured for (configured_rx_queues=4)

[root at compute-lxb-3 ~]# ovs-vsctl set interface dpdk-1 options:n_rxq=4

Here is the real-time VM queue stats

[root at c8-dpdk ~]# ethtool -S eth0
NIC statistics:
     rx_queue_0_packets: 0
     rx_queue_0_bytes: 0
     rx_queue_1_packets: 369536869
     rx_queue_1_bytes: 262805118523
     rx_queue_2_packets: 106161912
     rx_queue_2_bytes: 18047524850
     rx_queue_3_packets: 106111697
     rx_queue_3_bytes: 18038988269
     rx_queue_4_packets: 368173238
     rx_queue_4_bytes: 262783235737
     rx_queue_5_packets: 0
     rx_queue_5_bytes: 0
     rx_queue_6_packets: 0
     rx_queue_6_bytes: 0
     rx_queue_7_packets: 0
     rx_queue_7_bytes: 0
     tx_queue_0_packets: 48
     tx_queue_0_bytes: 7225
     tx_queue_1_packets: 277191537
     tx_queue_1_bytes: 64323249610
     tx_queue_2_packets: 308945384
     tx_queue_2_bytes: 66342845029
     tx_queue_3_packets: 116130494
     tx_queue_3_bytes: 31497220898
     tx_queue_4_packets: 4599
     tx_queue_4_bytes: 280111
     tx_queue_5_packets: 1068
     tx_queue_5_bytes: 89631
     tx_queue_6_packets: 1106
     tx_queue_6_bytes: 83642
     tx_queue_7_packets: 108513710
     tx_queue_7_bytes: 31040692056

I am noticing during load-test CPU interrupt going high on VM guest so
i believe my CPU running out of gas (because on VM i am not running
DPDK application and every single packet getting processed by kernel,
do you think that could be issue)

[root at c8-dpdk ~]# vmstat -n 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 3244980   2668 400820    0    0     0     0   12    5  0
0 100  0  0
 0  0      0 3243828   2668 400820    0    0     0     0 95518 2990  0
24 76  0  0
 0  0      0 3243988   2668 400820    0    0     0     0 95417 2986  0
24 76  0  0
 0  0      0 3244884   2668 400836    0    0     0     0 90471 2901  0
23 77  0  0
 0  0      0 3244756   2668 400836    0    0     0     0 84658 2733  0
24 76  0  0
 0  0      0 3244468   2668 400836    0    0     0     0 95634 2974  0
24 76  0  0
 0  0      0 3243732   2668 400836    0    0     0     0 95033 3010  0
24 76  0  0
 0  0      0 3244276   2668 400836    0    0     0     0 94840 2986  0
24 76  0  0
 0  0      0 3244596   2668 400836    0    0     0     0 98760 3114  0
24 76  0  0
 0  0      0 3244284   2668 400836    0    0     0     0 88780 2795  0
24 76  0  0
 0  0      0 3244956   2668 400836    0    0     0     0 87878 2756  0
24 76  0  0

I am getting the same result on the SR-IOV guest VM (it has nothing to
do with DPDK) . I believe it is my NIC or kernel limitation. Intel
82599 only support 2 queue for VF check out this link
https://community.intel.com/t5/Ethernet-Products/Intel-NIC-82599-EB-enable-SR-IOV-and-multiqueue/td-p/387696

Currently my Trex is sending packet to port0 and receiving packet from
port1 (my VM guest just forwarding packet from eth0 to eth1). Do you
think i should use testpmd on guest VM to boost packet forwarding from
interface a to b ?




On Mon, Nov 30, 2020 at 7:07 AM Sean Mooney <smooney at redhat.com> wrote:
>
> On Fri, 2020-11-27 at 18:19 -0500, Satish Patel wrote:
> > Sean,
> >
> > Here is the full list of requested output :
> > http://paste.openstack.org/show/800515/
> >
> > In the above output I am noticing ovs_tx_failure_drops=38000 and at
> > the same time Trex also showing me the same number of drops in its
> > result  I will try to dig into ARP flood, you are saying ARP flood
> > will be inside OVS switch right?
> >
> looking at the packet stats for the other interfaces the switch is not flooing the
> traffic.
>
> the vhu port  stats for vhu8710dd6b-e0 indicate that the vm is not reciving packets fast enough.
> so the vswitch (ovs-dpdk) needs to drop the packets.
> {ovs_rx_qos_drops=0, ovs_tx_failure_drops=38000, ovs_tx_invalid_hwol_drops=0, ovs_tx_mtu_exceeded_drops=0, ovs_tx_qos_drops=0, ovs_tx_retries=91,
> rx_1024_to_1522_packets=2, rx_128_to_255_packets=20, rx_1523_to_max_packets=0, rx_1_to_64_packets=27, rx_256_to_511_packets=24,
> rx_512_to_1023_packets=1, rx_65_to_127_packets=2379, rx_bytes=174105, rx_dropped=0, rx_errors=0, rx_packets=2453, tx_bytes=1088003898,
> tx_dropped=38000, tx_packets=17991999}
>
> based on num_of_vrings="16" i assume yo have enabeld virtio multi queue.
> have you also enabel that within the guest?
>
> older version of dpdk did not do negociation of which queue to enabel to if you did not enable all the queus in the guest it would cause really poor
> performance.
>
> in this case you have 8rx queues and 8 tx queues so you will want to ensure that trex in the the guest is using all 8 queues.
>
> dropping packets in dpdk is very expensive so even moderate drop rates negitivly impact its perfromace  more then you would expect.
> each drop is baciclly a branch prodictor miss and cause the procesor to role back all its peculitive execution it also likely a cache miss as
> the drop code is marked as cold in the code so ya drop in dpdk are expensive as a result.
>
> >  any command or any specific thing i
> > should look for?
> >
> > On Fri, Nov 27, 2020 at 8:49 AM Sean Mooney <smooney at redhat.com> wrote:
> > >
> > > On Thu, 2020-11-26 at 22:10 -0500, Satish Patel wrote:
> > > > Sean,
> > > >
> > > > Let me say "Happy Thanksgiving to you and your family". Thank you for
> > > > taking time and reply, the last 2 days I was trying to find you on IRC
> > > > to discuss this issue. Let me explain to you what I did so far.
> > > >
> > > > * First i did load-testing on my bare metal compute node to see how
> > > > far my Trex can go and i found it Hit 2 million packet per second (Not
> > > > sure if this is good result or not but it prove that i can hit at
> > > > least 1 million pps)
> > > for 64byte packets on that nic it should be hittihng about 11mpps on one core.
> > > that said i have not validated that in a year or two but it could eaisly saturate 10G linerate with 64b
> > > packest with 1 cores in the past.
> > > >
> > > > * Then i create sriov VM on that compute node with ( 8vCPU/8GB mem)
> > > > and i re-run Trex and my max result was 323kpps without dropping
> > > > packet)  I found Intel 82599 nic VF only support 2 queue rx/tx and
> > > > that could be bottleneck)
> > > a VF can fully saturate the nic and hit 14.4 mpps if your cpu clock rate is fstat enough
> > >
> > > i.e. >3.2-3.5GHz on a 2.5GHz you porably wont hit that with 1 core but you shoudl get >10mpps
> > >
> > > >
> > > > * Finally i decided to build DPDK vm on it and see how Trex behaved on
> > > > it and i found it hit max ~400kpps with 4 PMD core. (little better
> > > > than sriov because now i have 4 rx/tx queue thanks to 4 PMD core)
> > >
> > > ya so these number are far to low for a correctly complied and fuctioning trex binary
> > >
> > > >
> > > > For Trex load-test i did statically assigned ARP entries because its
> > > > part of Trex process to use static arp.
> > > >
> > > that wont work properly. if you do that the ovs bridge will not have its mac learning
> > > table populated so it will flood packets.  to do dpdk
> > > > You are saying it should hit
> > > > 11 million pps but question is what tools you guys using to hit that
> > > > number i didn't see anyone using Trex for DPDK testing most of people
> > > > using testpmd.
> > > trex is a trafic generaotr orginally created by cisco i think
> > > it often used in combination with testpmd. testpmd was desing to these the pool
> > > mode driver as the name implice but its also uses as a replacement for a device/appliction
> > > under test to messure the low level performacne in l2/l3 forading modes or basic mac swap mode.
> > >
> > >
> > > >
> > > > what kind of vm and (vCPU/memory people using to reach 11 million
> > > > pps?)
> > > >
> > >
> > > 2-4 vcpus with 2G or ram.
> > > if dpdk is compile propertly and funtionion you dont need a lot of core although you will  need
> > > to use cpu pinning and hugepages for the vm and within the vm you will also need hugpeages if you are using dpdk there too.
> > >
> > > >  I am stick to 8 vcpu because majority of my server has 8 core VM
> > > > size so trying to get most of performance out of it.)
> > > >
> > > > If you have your load-test scenario available or tools available then
> > > > please share some information so i will try to mimic that in my
> > > > environment.  thank you for reply.
> > >
> > > i think you need to start with getting trex to actully hit 10G linerate with small packets.
> > > as i said you should not need more then about 2 cores to do that and 1-2 G of hugepages.
> > >
> > > once you have tha tworking you can move on to the rest but you need to ensure proper mac learning happens and arps are sent and replied
> > > too before starting the traffic generattor so that floodign does not happen.
> > > can you also provide the output of
> > >
> > > sudo ovs-vsctl list Open_vSwitch .
> > > and the output of
> > > sudo ovs-vsctl show, sudo ovs-vsctl list bridge, sudo ovs-vsctl list port and  sudo ovs-vsctl list interface
> > >
> > > i just want to confirm that you have properly confiugred ovs-dpdk to use dpdk
> > >
> > > i dont work with dpdk that offent any more but i generally used testpmd in the guest with an ixia hardware traffic generator
> > > to do performance messurments. i have used trex and it can hit line rate so im not sure why you are seeign such low performance.
> > >
> > > >
> > > > ~S
> > > >
> > > >
> > > > On Thu, Nov 26, 2020 at 8:14 PM Sean Mooney <smooney at redhat.com> wrote:
> > > > >
> > > > > On Thu, 2020-11-26 at 16:56 -0500, Satish Patel wrote:
> > > > > > Folks,
> > > > > >
> > > > > > I am playing with DPDK on my openstack with NIC model 82599 and seeing
> > > > > > poor performance, i may be wrong with my numbers so want to see what
> > > > > > the community thinks about these results.
> > > > > >
> > > > > > Compute node hardware:
> > > > > >
> > > > > > CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
> > > > > > Memory: 64G
> > > > > > NIC: Intel 82599 (dual 10G port)
> > > > > >
> > > > > > [root at compute-lxb-3 ~]# ovs-vswitchd --version
> > > > > > ovs-vswitchd (Open vSwitch) 2.13.2
> > > > > > DPDK 19.11.3
> > > > > >
> > > > > > VM dpdk (DUT):
> > > > > > 8vCPU / 8GB memory
> > > > > >
> > > > > > I have configured my computer node for all best practice available on
> > > > > > the internet to get more performance out.
> > > > > >
> > > > > > 1. Used isolcpus to islocate CPUs
> > > > > > 2. 4 dedicated core for PMD
> > > > > > 3. echo isolated_cores=1,9,25,33 >> /etc/tuned/cpu-partitioning-variables.conf
> > > > > > 4. Huge pages
> > > > > > 5. CPU pinning for VM
> > > > > > 6. increase  ( ovs-vsctl set interface dpdk-1 options:n_rxq=4 )
> > > > > > 7. VM virtio_ring = 1024
> > > > > >
> > > > > > After doing all above I am getting the following result using the Trex
> > > > > > packet generator using 64B UDP stream (Total-PPS       :     391.93
> > > > > > Kpps)  Do you think it's an acceptable result or should it be higher
> > > > > > on these NIC models?
> > > > > that is one of inteles oldest generation 10G nics that is supported by dpdk
> > > > >
> > > > > but it shoudl still get to about 11 million packet per second with 1-2 cores
> > > > >
> > > > > my guess would be that the vm or trafic gerneator are not sentding and reciving mac learnign
> > > > > frames like arp properly and as a result the packtes are flooding which will severly
> > > > > reduce perfomance.
> > > > > >
> > > > > > On the internet folks say it should be a million packets per second so
> > > > > > not sure what and how those people reached there or i am missing
> > > > > > something in my load test profile.
> > > > >
> > > > > even kernel ovs will break a million packets persecond so 400Kpps is far to low
> > > > > there is sometin gmisconfigred but  im not sure what specificly form what you have shared.
> > > > > as i said my best guess would be that the backets are flooding because the vm is not
> > > > > responding to arp and the normal action is not learn the mac address.
> > > > >
> > > > > you could rule that out by adding hardcoded rules but you could also check the flow tables to confirm
> > > > > >
> > > > > > Notes: I am using 8vCPU core on VM do you think adding more cores will
> > > > > > help? OR should i add more PMD?
> > > > > >
> > > > > > Cpu Utilization : 2.2  %  1.8 Gb/core
> > > > > >  Platform_factor : 1.0
> > > > > >  Total-Tx        :     200.67 Mbps
> > > > > >  Total-Rx        :     200.67 Mbps
> > > > > >  Total-PPS       :     391.93 Kpps
> > > > > >  Total-CPS       :     391.89 Kcps
> > > > > >
> > > > > >  Expected-PPS    :     700.00 Kpps
> > > > > >  Expected-CPS    :     700.00 Kcps
> > > > > >  Expected-BPS    :     358.40 Mbps
> > > > > >
> > > > > >
> > > > > > This is my all configuration:
> > > > > >
> > > > > > grub.conf:
> > > > > > GRUB_CMDLINE_LINUX="vmalloc=384M crashkernel=auto
> > > > > > rd.lvm.lv=rootvg01/lv01 console=ttyS1,118200 rhgb quiet intel_iommu=on
> > > > > > iommu=pt spectre_v2=off nopti pti=off nospec_store_bypass_disable
> > > > > > spec_store_bypass_disable=off l1tf=off default_hugepagesz=1GB
> > > > > > hugepagesz=1G hugepages=60 transparent_hugepage=never selinux=0
> > > > > > isolcpus=2,3,4,5,6,7,10,11,12,13,14,15,26,27,28,29,30,31,34,35,36,37,38,39"
> > > > > >
> > > > > >
> > > > > > [root at compute-lxb-3 ~]# ovs-appctl dpif/show
> > > > > > netdev at ovs-netdev: hit:605860720 missed:2129
> > > > > >   br-int:
> > > > > >     br-int 65534/3: (tap)
> > > > > >     int-br-vlan 1/none: (patch: peer=phy-br-vlan)
> > > > > >     patch-tun 2/none: (patch: peer=patch-int)
> > > > > >     vhu1d64ea7d-d9 5/6: (dpdkvhostuserclient: configured_rx_queues=8,
> > > > > > configured_tx_queues=8, mtu=1500, requested_rx_queues=8,
> > > > > > requested_tx_queues=8)
> > > > > >     vhu9c32faf6-ac 6/7: (dpdkvhostuserclient: configured_rx_queues=8,
> > > > > > configured_tx_queues=8, mtu=1500, requested_rx_queues=8,
> > > > > > requested_tx_queues=8)
> > > > > >   br-tun:
> > > > > >     br-tun 65534/4: (tap)
> > > > > >     patch-int 1/none: (patch: peer=patch-tun)
> > > > > >     vxlan-0a410071 2/5: (vxlan: egress_pkt_mark=0, key=flow,
> > > > > > local_ip=10.65.0.114, remote_ip=10.65.0.113)
> > > > > >   br-vlan:
> > > > > >     br-vlan 65534/1: (tap)
> > > > > >     dpdk-1 2/2: (dpdk: configured_rx_queues=4,
> > > > > > configured_rxq_descriptors=2048, configured_tx_queues=5,
> > > > > > configured_txq_descriptors=2048, lsc_interrupt_mode=false, mtu=1500,
> > > > > > requested_rx_queues=4, requested_rxq_descriptors=2048,
> > > > > > requested_tx_queues=5, requested_txq_descriptors=2048,
> > > > > > rx_csum_offload=true, tx_tso_offload=false)
> > > > > >     phy-br-vlan 1/none: (patch: peer=int-br-vlan)
> > > > > >
> > > > > >
> > > > > > [root at compute-lxb-3 ~]# ovs-appctl dpif-netdev/pmd-rxq-show
> > > > > > pmd thread numa_id 0 core_id 1:
> > > > > >   isolated : false
> > > > > >   port: dpdk-1            queue-id:  0 (enabled)   pmd usage:  0 %
> > > > > >   port: vhu1d64ea7d-d9    queue-id:  3 (enabled)   pmd usage:  0 %
> > > > > >   port: vhu1d64ea7d-d9    queue-id:  4 (enabled)   pmd usage:  0 %
> > > > > >   port: vhu9c32faf6-ac    queue-id:  3 (enabled)   pmd usage:  0 %
> > > > > >   port: vhu9c32faf6-ac    queue-id:  4 (enabled)   pmd usage:  0 %
> > > > > > pmd thread numa_id 0 core_id 9:
> > > > > >   isolated : false
> > > > > >   port: dpdk-1            queue-id:  1 (enabled)   pmd usage:  0 %
> > > > > >   port: vhu1d64ea7d-d9    queue-id:  2 (enabled)   pmd usage:  0 %
> > > > > >   port: vhu1d64ea7d-d9    queue-id:  5 (enabled)   pmd usage:  0 %
> > > > > >   port: vhu9c32faf6-ac    queue-id:  2 (enabled)   pmd usage:  0 %
> > > > > >   port: vhu9c32faf6-ac    queue-id:  5 (enabled)   pmd usage:  0 %
> > > > > > pmd thread numa_id 0 core_id 25:
> > > > > >   isolated : false
> > > > > >   port: dpdk-1            queue-id:  3 (enabled)   pmd usage:  0 %
> > > > > >   port: vhu1d64ea7d-d9    queue-id:  0 (enabled)   pmd usage:  0 %
> > > > > >   port: vhu1d64ea7d-d9    queue-id:  7 (enabled)   pmd usage:  0 %
> > > > > >   port: vhu9c32faf6-ac    queue-id:  0 (enabled)   pmd usage:  0 %
> > > > > >   port: vhu9c32faf6-ac    queue-id:  7 (enabled)   pmd usage:  0 %
> > > > > > pmd thread numa_id 0 core_id 33:
> > > > > >   isolated : false
> > > > > >   port: dpdk-1            queue-id:  2 (enabled)   pmd usage:  0 %
> > > > > >   port: vhu1d64ea7d-d9    queue-id:  1 (enabled)   pmd usage:  0 %
> > > > > >   port: vhu1d64ea7d-d9    queue-id:  6 (enabled)   pmd usage:  0 %
> > > > > >   port: vhu9c32faf6-ac    queue-id:  1 (enabled)   pmd usage:  0 %
> > > > > >   port: vhu9c32faf6-ac    queue-id:  6 (enabled)   pmd usage:  0 %
> > > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> > >
> >
>
>



More information about the openstack-discuss mailing list