OpenStack network tap interface random rxdrops for some VMs

Satish Patel satish.txt at gmail.com
Wed Jul 21 20:02:20 UTC 2021


I used iptraf-ng tool to find out how much pps on my vm. 

You can use any packet generator tool and blast it to Vm

Sent from my iPhone

> On Jul 21, 2021, at 3:54 PM, hai wu <haiwu.us at gmail.com> wrote:
> 
> SRIOV could be the final big hammer to try, if still seeing drops
> after ensuring all others. How do you measure PPS rate for existing VM
> instances? Only using Openstack provider networking, not Open vSwitch,
> so can't use its command.
> 
>> On Wed, Jul 21, 2021 at 12:07 PM Satish Patel <satish.txt at gmail.com> wrote:
>> 
>> I had same issue 3 years ago when we launched first openstack private
>> cloud without knowing all details of virtual network performance, i
>> did all kind of tuning whatever you talked above but none of them
>> help, issue is tap interface (or called virtio) they can't handle high
>> PPS rate because they run under kernel space and when you have high
>> packet rate hitting your tap interface all processing happened in
>> kernel and that drive high interrupted cause packet getting dropped
>> out.
>> 
>> To solve the packet drop issue I have migrated my cloud to run on
>> SRIOV (because at that time dpdk was new).  I would say find out what
>> is your PPS rate on that instance which drops the packet.  In my
>> benchmark we found 100kpps was a hard limit after that we started
>> seeing drops in RX.
>> 
>> 
>>> On Tue, Jul 20, 2021 at 12:22 PM hai wu <haiwu.us at gmail.com> wrote:
>>> 
>>> We use 'train' here, and are using Openstack Debian distribution. Do
>>> you know why tx_queue_size is being ignored here? rx_queue_size is
>>> working all ok.
>>> 
>>> On Tue, Jul 20, 2021 at 7:11 AM Sean Mooney <smooney at redhat.com> wrote:
>>>> 
>>>> On Mon, 2021-07-19 at 18:13 -0500, hai wu wrote:
>>>>> Thanks. But this redhat KB article is also suggesting to set tap
>>>>> txqueuelen to 10000:
>>>>> https://access.redhat.com/solutions/2785881
>>>>> 
>>>>> Also it seems I might be hitting some known openstack bugs. nova would
>>>>> only update relevant VM libvirt XML with rx_queue_size=1024, and it
>>>>> would consistently ignore tx_queue_size=1024, even though that is
>>>>> being configured in nova.conf, and systemctl restart nova-compute
>>>>> already. Maybe hitting this known bug here?
>>>>> https://github.com/openstack/nova/commit/7ee4fefa2f6cc98dbd7b3d6636949498a6a23dd5
>>>> that has been fixed a long time ago and if you set both tx and rx queue size then it would have still worked before.
>>>> the bug would have only manifestetd if you set both so if you have set both that shoudl not be a factor.
>>>> 
>>>> can you confrim what verion of openstack you are deploying is it train? is it OSP or RDO? or something else?
>>>>> 
>>>>> On Mon, Jul 19, 2021 at 5:39 PM Sean Mooney <smooney at redhat.com> wrote:
>>>>>> 
>>>>>> On Mon, 2021-07-19 at 13:54 -0500, hai wu wrote:
>>>>>>> hmm for txqueuelen, I actually followed recommendation from here:
>>>>>>> https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html/ovs-dpdk_end_to_end_troubleshooting_guide/high_packet_loss_in_the_tx_queue_of_the_instance_s_tap_interface,
>>>>>>> where it suggests to do this:
>>>>>>> 
>>>>>>> cat <<'EOF'>/etc/udev/rules.d/71-net-txqueuelen.rules
>>>>>>> SUBSYSTEM=="net", ACTION=="add", KERNEL=="tap*", ATTR{tx_queue_len}="10000"
>>>>>>> EOF
>>>>>>> 
>>>>>>> or
>>>>>>> 
>>>>>>> /sbin/ip link set tap<uuid> txqueuelen 10000
>>>>>> 
>>>>>> thanks for bring that document to my attention si will escalate it internally to ensure its removed as the content is incorrect.
>>>>>> that document is part of the ovs-dpdk end to end trubleshooting guide but we do not support the use of tap devices with ovs-dpdk.
>>>>>> 
>>>>>> i have filed a downstream bug to correct this https://bugzilla.redhat.com/show_bug.cgi?id=1983828
>>>>>> 
>>>>>> The use of a tap device with ovs-dpdk is highly ineffect as the tap deviece is not dpdk accleeratred and is instead handel on teh main tread of the ovs-vsctid process.
>>>>>> this is severly limited in perfromance and under heavy traffic load can cause issue wiht programing openflow rules.
>>>>>> 
>>>>>>> 
>>>>>>> I will try to bump up both libvirt/rx_queue_size and
>>>>>>> libvirt/tx_queue_size to 1024, just not sure about the difference
>>>>>>> between the above and the corresponding libvirt one.
>>>>>>> 
>>>>>> Ignoring that setting parmaters on tap deveices via udev would be unsupproted in vendor distributiosn of openstack the main
>>>>>> one is that doign ti correctly with the config option will work on vhost-user port and any virtio backend that supports them where
>>>>>> as the udev apptoch will only work with tap devices.
>>>>>> 
>>>>>> the udev rule is altering the paramaters of the tap device
>>>>>> 
>>>>>> cat <<'EOF'>/etc/udev/rules.d/71-net-txqueuelen.rules
>>>>>> SUBSYSTEM=="net", ACTION=="add", KERNEL=="tap*", ATTR{tx_queue_len}="10000"
>>>>>> EOF
>>>>>> 
>>>>>> but im not sure if those chages will be present in the guest as it not clear to me that they will alter the virtio-net-pci device frontend created by qemu
>>>>>> which is presented to the guest.  setting the vaules in the nova.conf will update the contencce of the libvirt xml and it will ensure that both are set correctly.
>>>>>> 
>>>>>> a queue lenght of 10000 is not one of the lengts supported by qemu so im not sure that it will actully help. at most i suspect that
>>>>>> it will add addtional buffering but as far as i am aware the max queue lenght supporte by qemu is 1024
>>>>>> 
>>>>>>> Also it seems by
>>>>>>> default rx_queue_size and tx_queue_size would be None, which means
>>>>>>> bumping them up to 1024 should help with packet drops?
>>>>>>> 
>>>>>>> On Mon, Jul 19, 2021 at 1:18 PM Sean Mooney <smooney at redhat.com> wrote:
>>>>>>>> 
>>>>>>>> On Mon, 2021-07-19 at 19:16 +0100, Sean Mooney wrote:
>>>>>>>>> On Mon, 2021-07-19 at 12:54 -0500, hai wu wrote:
>>>>>>>>>> I already ensured txqueuelen for that VM's tap interface, and enabled
>>>>>>>>>> multi-queue for the VM, but its tap rxdrop still randomly kept
>>>>>>>>>> increasing, dropping one packet every few 10 or 20  or 60 seconds (It
>>>>>>>>>> makes sense to me, since that would only help with txdrops, not
>>>>>>>>>> rxdrops per my understandingj).
>>>>>>>>>> 
>>>>>>>> multi queue should help with both tx and rx drops by the way.
>>>>>>>> 
>>>>>>>> when you enabel multi queue we allocate 1 rx and tx queue per vm cpu.
>>>>>>>> which should allow the network backend to process more packets in parallel form the vm.
>>>>>>>> if the network backend is overloaded and cannot process anymore packets tehn addimg more rx queues wont help
>>>>>>>> but provided you network backend is not not the bottelneck it will.
>>>>>>>>>> This is an idle VM. After migrating
>>>>>>>>>> this idle VM to another idle OpenStack hypervisor, so it is only one
>>>>>>>>>> VM running on one dedicated physical OpenStack hypervisor, and now it
>>>>>>>>>> is dropping 0.03 rxdrops/s every 10 minutes.
>>>>>>>>>> 
>>>>>>>>>> I am not aware of any way to configure rxqueuelen for tap interface.
>>>>>>>>>> 
>>>>>>>>> you confirure the rx queue lenght the same way you confire tx queue lenght
>>>>>>>>> https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.rx_queue_size
>>>>>>>>> and tx queue lenght is configred by https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.tx_queue_size which you presumabnle already have set.
>>>>>>>>> 
>>>>>>>>>> I
>>>>>>>>>> assume tap RX -> Hypervisor RX -> VM TX, correct? How to tune
>>>>>>>>> rx on the taps is tx from the vm yes.
>>>>>>>>> rx drops normally means the packets were droped by the network backend for some reason.
>>>>>>>>> e.g. ovs or linux bridge is discarding packets.
>>>>>>>>>> rxqueuelen for tap interface? If logging into this idle Linux test VM,
>>>>>>>>>> I am not seeing any drops, either in its RX or TX.
>>>>>>>>>> 
>>>>>>>>>> On Mon, Jul 19, 2021 at 12:03 PM Sean Mooney <smooney at redhat.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> On Mon, 2021-07-19 at 11:39 -0500, hai wu wrote:
>>>>>>>>>>>> There are random very slow rxdrops for certain OpenStack VMs for their
>>>>>>>>>>>> network tap interfaces. Please note that this is rxdrop, NOT txdrop.
>>>>>>>>>>>> 
>>>>>>>>>>>> I know we could tune txqueuelen and multi-queue for tap network
>>>>>>>>>>>> interface txdrop issue, but is there any known way to tune for this
>>>>>>>>>>>> tap network interface rxdrop issue?
>>>>>>>>>>> i think a rx drop typically means the vswtich/kernel is droping packets so i think
>>>>>>>>>>> any tuneing you applied would have to be on the kernel side.
>>>>>>>>>>> 
>>>>>>>>>>> with that said you can configure the rxqueue lenght and enabel multi queue will also result in
>>>>>>>>>>> addtional rx queue so it may help but no i dont know of any one fix for this you will have
>>>>>>>>>>> to see what turning work in yoru env for your given traffic profile.
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Hai
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 



More information about the openstack-discuss mailing list