Thanks. But this redhat KB article is also suggesting to set tap txqueuelen to 10000: https://access.redhat.com/solutions/2785881
Also it seems I might be hitting some known openstack bugs. nova would only update relevant VM libvirt XML with rx_queue_size=1024, and it would consistently ignore tx_queue_size=1024, even though that is being configured in nova.conf, and systemctl restart nova-compute already. Maybe hitting this known bug here? https://github.com/openstack/nova/commit/7ee4fefa2f6cc98dbd7b3d6636949498a6a...
On Mon, 2021-07-19 at 18:13 -0500, hai wu wrote: that has been fixed a long time ago and if you set both tx and rx queue size then it would have still worked before. the bug would have only manifestetd if you set both so if you have set both that shoudl not be a factor. can you confrim what verion of openstack you are deploying is it train? is it OSP or RDO? or something else?
On Mon, Jul 19, 2021 at 5:39 PM Sean Mooney <smooney@redhat.com> wrote:
On Mon, 2021-07-19 at 13:54 -0500, hai wu wrote:
hmm for txqueuelen, I actually followed recommendation from here: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/..., where it suggests to do this:
cat <<'EOF'>/etc/udev/rules.d/71-net-txqueuelen.rules SUBSYSTEM=="net", ACTION=="add", KERNEL=="tap*", ATTR{tx_queue_len}="10000" EOF
or
/sbin/ip link set tap<uuid> txqueuelen 10000
thanks for bring that document to my attention si will escalate it internally to ensure its removed as the content is incorrect. that document is part of the ovs-dpdk end to end trubleshooting guide but we do not support the use of tap devices with ovs-dpdk.
i have filed a downstream bug to correct this https://bugzilla.redhat.com/show_bug.cgi?id=1983828
The use of a tap device with ovs-dpdk is highly ineffect as the tap deviece is not dpdk accleeratred and is instead handel on teh main tread of the ovs-vsctid process. this is severly limited in perfromance and under heavy traffic load can cause issue wiht programing openflow rules.
I will try to bump up both libvirt/rx_queue_size and libvirt/tx_queue_size to 1024, just not sure about the difference between the above and the corresponding libvirt one.
Ignoring that setting parmaters on tap deveices via udev would be unsupproted in vendor distributiosn of openstack the main one is that doign ti correctly with the config option will work on vhost-user port and any virtio backend that supports them where as the udev apptoch will only work with tap devices.
the udev rule is altering the paramaters of the tap device
cat <<'EOF'>/etc/udev/rules.d/71-net-txqueuelen.rules SUBSYSTEM=="net", ACTION=="add", KERNEL=="tap*", ATTR{tx_queue_len}="10000" EOF
but im not sure if those chages will be present in the guest as it not clear to me that they will alter the virtio-net-pci device frontend created by qemu which is presented to the guest. setting the vaules in the nova.conf will update the contencce of the libvirt xml and it will ensure that both are set correctly.
a queue lenght of 10000 is not one of the lengts supported by qemu so im not sure that it will actully help. at most i suspect that it will add addtional buffering but as far as i am aware the max queue lenght supporte by qemu is 1024
Also it seems by default rx_queue_size and tx_queue_size would be None, which means bumping them up to 1024 should help with packet drops?
On Mon, Jul 19, 2021 at 1:18 PM Sean Mooney <smooney@redhat.com> wrote:
On Mon, 2021-07-19 at 19:16 +0100, Sean Mooney wrote:
On Mon, 2021-07-19 at 12:54 -0500, hai wu wrote:
I already ensured txqueuelen for that VM's tap interface, and enabled multi-queue for the VM, but its tap rxdrop still randomly kept increasing, dropping one packet every few 10 or 20 or 60 seconds (It makes sense to me, since that would only help with txdrops, not rxdrops per my understandingj).
multi queue should help with both tx and rx drops by the way.
when you enabel multi queue we allocate 1 rx and tx queue per vm cpu. which should allow the network backend to process more packets in parallel form the vm. if the network backend is overloaded and cannot process anymore packets tehn addimg more rx queues wont help but provided you network backend is not not the bottelneck it will.
This is an idle VM. After migrating this idle VM to another idle OpenStack hypervisor, so it is only one VM running on one dedicated physical OpenStack hypervisor, and now it is dropping 0.03 rxdrops/s every 10 minutes.
I am not aware of any way to configure rxqueuelen for tap interface.
you confirure the rx queue lenght the same way you confire tx queue lenght https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.rx_... and tx queue lenght is configred by https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.tx_... which you presumabnle already have set.
I assume tap RX -> Hypervisor RX -> VM TX, correct? How to tune rx on the taps is tx from the vm yes. rx drops normally means the packets were droped by the network backend for some reason. e.g. ovs or linux bridge is discarding packets. rxqueuelen for tap interface? If logging into this idle Linux test VM, I am not seeing any drops, either in its RX or TX.
On Mon, Jul 19, 2021 at 12:03 PM Sean Mooney <smooney@redhat.com> wrote: > > On Mon, 2021-07-19 at 11:39 -0500, hai wu wrote: > > There are random very slow rxdrops for certain OpenStack VMs for their > > network tap interfaces. Please note that this is rxdrop, NOT txdrop. > > > > I know we could tune txqueuelen and multi-queue for tap network > > interface txdrop issue, but is there any known way to tune for this > > tap network interface rxdrop issue? > i think a rx drop typically means the vswtich/kernel is droping packets so i think > any tuneing you applied would have to be on the kernel side. > > with that said you can configure the rxqueue lenght and enabel multi queue will also result in > addtional rx queue so it may help but no i dont know of any one fix for this you will have > to see what turning work in yoru env for your given traffic profile. > > > > Thanks, > > Hai > > > >