OpenStack network tap interface random rxdrops for some VMs
hai wu
haiwu.us at gmail.com
Wed Jul 21 19:54:11 UTC 2021
SRIOV could be the final big hammer to try, if still seeing drops
after ensuring all others. How do you measure PPS rate for existing VM
instances? Only using Openstack provider networking, not Open vSwitch,
so can't use its command.
On Wed, Jul 21, 2021 at 12:07 PM Satish Patel <satish.txt at gmail.com> wrote:
>
> I had same issue 3 years ago when we launched first openstack private
> cloud without knowing all details of virtual network performance, i
> did all kind of tuning whatever you talked above but none of them
> help, issue is tap interface (or called virtio) they can't handle high
> PPS rate because they run under kernel space and when you have high
> packet rate hitting your tap interface all processing happened in
> kernel and that drive high interrupted cause packet getting dropped
> out.
>
> To solve the packet drop issue I have migrated my cloud to run on
> SRIOV (because at that time dpdk was new). I would say find out what
> is your PPS rate on that instance which drops the packet. In my
> benchmark we found 100kpps was a hard limit after that we started
> seeing drops in RX.
>
>
> On Tue, Jul 20, 2021 at 12:22 PM hai wu <haiwu.us at gmail.com> wrote:
> >
> > We use 'train' here, and are using Openstack Debian distribution. Do
> > you know why tx_queue_size is being ignored here? rx_queue_size is
> > working all ok.
> >
> > On Tue, Jul 20, 2021 at 7:11 AM Sean Mooney <smooney at redhat.com> wrote:
> > >
> > > On Mon, 2021-07-19 at 18:13 -0500, hai wu wrote:
> > > > Thanks. But this redhat KB article is also suggesting to set tap
> > > > txqueuelen to 10000:
> > > > https://access.redhat.com/solutions/2785881
> > > >
> > > > Also it seems I might be hitting some known openstack bugs. nova would
> > > > only update relevant VM libvirt XML with rx_queue_size=1024, and it
> > > > would consistently ignore tx_queue_size=1024, even though that is
> > > > being configured in nova.conf, and systemctl restart nova-compute
> > > > already. Maybe hitting this known bug here?
> > > > https://github.com/openstack/nova/commit/7ee4fefa2f6cc98dbd7b3d6636949498a6a23dd5
> > > that has been fixed a long time ago and if you set both tx and rx queue size then it would have still worked before.
> > > the bug would have only manifestetd if you set both so if you have set both that shoudl not be a factor.
> > >
> > > can you confrim what verion of openstack you are deploying is it train? is it OSP or RDO? or something else?
> > > >
> > > > On Mon, Jul 19, 2021 at 5:39 PM Sean Mooney <smooney at redhat.com> wrote:
> > > > >
> > > > > On Mon, 2021-07-19 at 13:54 -0500, hai wu wrote:
> > > > > > hmm for txqueuelen, I actually followed recommendation from here:
> > > > > > https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html/ovs-dpdk_end_to_end_troubleshooting_guide/high_packet_loss_in_the_tx_queue_of_the_instance_s_tap_interface,
> > > > > > where it suggests to do this:
> > > > > >
> > > > > > cat <<'EOF'>/etc/udev/rules.d/71-net-txqueuelen.rules
> > > > > > SUBSYSTEM=="net", ACTION=="add", KERNEL=="tap*", ATTR{tx_queue_len}="10000"
> > > > > > EOF
> > > > > >
> > > > > > or
> > > > > >
> > > > > > /sbin/ip link set tap<uuid> txqueuelen 10000
> > > > >
> > > > > thanks for bring that document to my attention si will escalate it internally to ensure its removed as the content is incorrect.
> > > > > that document is part of the ovs-dpdk end to end trubleshooting guide but we do not support the use of tap devices with ovs-dpdk.
> > > > >
> > > > > i have filed a downstream bug to correct this https://bugzilla.redhat.com/show_bug.cgi?id=1983828
> > > > >
> > > > > The use of a tap device with ovs-dpdk is highly ineffect as the tap deviece is not dpdk accleeratred and is instead handel on teh main tread of the ovs-vsctid process.
> > > > > this is severly limited in perfromance and under heavy traffic load can cause issue wiht programing openflow rules.
> > > > >
> > > > > >
> > > > > > I will try to bump up both libvirt/rx_queue_size and
> > > > > > libvirt/tx_queue_size to 1024, just not sure about the difference
> > > > > > between the above and the corresponding libvirt one.
> > > > > >
> > > > > Ignoring that setting parmaters on tap deveices via udev would be unsupproted in vendor distributiosn of openstack the main
> > > > > one is that doign ti correctly with the config option will work on vhost-user port and any virtio backend that supports them where
> > > > > as the udev apptoch will only work with tap devices.
> > > > >
> > > > > the udev rule is altering the paramaters of the tap device
> > > > >
> > > > > cat <<'EOF'>/etc/udev/rules.d/71-net-txqueuelen.rules
> > > > > SUBSYSTEM=="net", ACTION=="add", KERNEL=="tap*", ATTR{tx_queue_len}="10000"
> > > > > EOF
> > > > >
> > > > > but im not sure if those chages will be present in the guest as it not clear to me that they will alter the virtio-net-pci device frontend created by qemu
> > > > > which is presented to the guest. setting the vaules in the nova.conf will update the contencce of the libvirt xml and it will ensure that both are set correctly.
> > > > >
> > > > > a queue lenght of 10000 is not one of the lengts supported by qemu so im not sure that it will actully help. at most i suspect that
> > > > > it will add addtional buffering but as far as i am aware the max queue lenght supporte by qemu is 1024
> > > > >
> > > > > > Also it seems by
> > > > > > default rx_queue_size and tx_queue_size would be None, which means
> > > > > > bumping them up to 1024 should help with packet drops?
> > > > > >
> > > > > > On Mon, Jul 19, 2021 at 1:18 PM Sean Mooney <smooney at redhat.com> wrote:
> > > > > > >
> > > > > > > On Mon, 2021-07-19 at 19:16 +0100, Sean Mooney wrote:
> > > > > > > > On Mon, 2021-07-19 at 12:54 -0500, hai wu wrote:
> > > > > > > > > I already ensured txqueuelen for that VM's tap interface, and enabled
> > > > > > > > > multi-queue for the VM, but its tap rxdrop still randomly kept
> > > > > > > > > increasing, dropping one packet every few 10 or 20 or 60 seconds (It
> > > > > > > > > makes sense to me, since that would only help with txdrops, not
> > > > > > > > > rxdrops per my understandingj).
> > > > > > > > >
> > > > > > > multi queue should help with both tx and rx drops by the way.
> > > > > > >
> > > > > > > when you enabel multi queue we allocate 1 rx and tx queue per vm cpu.
> > > > > > > which should allow the network backend to process more packets in parallel form the vm.
> > > > > > > if the network backend is overloaded and cannot process anymore packets tehn addimg more rx queues wont help
> > > > > > > but provided you network backend is not not the bottelneck it will.
> > > > > > > > > This is an idle VM. After migrating
> > > > > > > > > this idle VM to another idle OpenStack hypervisor, so it is only one
> > > > > > > > > VM running on one dedicated physical OpenStack hypervisor, and now it
> > > > > > > > > is dropping 0.03 rxdrops/s every 10 minutes.
> > > > > > > > >
> > > > > > > > > I am not aware of any way to configure rxqueuelen for tap interface.
> > > > > > > > >
> > > > > > > > you confirure the rx queue lenght the same way you confire tx queue lenght
> > > > > > > > https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.rx_queue_size
> > > > > > > > and tx queue lenght is configred by https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.tx_queue_size which you presumabnle already have set.
> > > > > > > >
> > > > > > > > > I
> > > > > > > > > assume tap RX -> Hypervisor RX -> VM TX, correct? How to tune
> > > > > > > > rx on the taps is tx from the vm yes.
> > > > > > > > rx drops normally means the packets were droped by the network backend for some reason.
> > > > > > > > e.g. ovs or linux bridge is discarding packets.
> > > > > > > > > rxqueuelen for tap interface? If logging into this idle Linux test VM,
> > > > > > > > > I am not seeing any drops, either in its RX or TX.
> > > > > > > > >
> > > > > > > > > On Mon, Jul 19, 2021 at 12:03 PM Sean Mooney <smooney at redhat.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Mon, 2021-07-19 at 11:39 -0500, hai wu wrote:
> > > > > > > > > > > There are random very slow rxdrops for certain OpenStack VMs for their
> > > > > > > > > > > network tap interfaces. Please note that this is rxdrop, NOT txdrop.
> > > > > > > > > > >
> > > > > > > > > > > I know we could tune txqueuelen and multi-queue for tap network
> > > > > > > > > > > interface txdrop issue, but is there any known way to tune for this
> > > > > > > > > > > tap network interface rxdrop issue?
> > > > > > > > > > i think a rx drop typically means the vswtich/kernel is droping packets so i think
> > > > > > > > > > any tuneing you applied would have to be on the kernel side.
> > > > > > > > > >
> > > > > > > > > > with that said you can configure the rxqueue lenght and enabel multi queue will also result in
> > > > > > > > > > addtional rx queue so it may help but no i dont know of any one fix for this you will have
> > > > > > > > > > to see what turning work in yoru env for your given traffic profile.
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Hai
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > >
> > >
> >
More information about the openstack-discuss
mailing list