netdev - Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <96ff3930904300207l4ecfe90byd6cce3f56ce4e113@mail.gmail.com>
Date:	Thu, 30 Apr 2009 11:07:35 +0200
From:	Jens Låås <jelaas@...il.com>
To:	Andrew Dickinson <andrew@...dna.net>, netdev@...r.kernel.org
Subject: Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe)

2009/4/30, Andrew Dickinson <andrew@...dna.net>:
> Howdy list,
>
>  Background...
>  I'm trying to evaluate a new system for routing performance for some
>  custom packet modification that we do.  To start, I'm trying to get a
>  high-water mark of routing performance without our custom cruft in the
>  middle.  The hardware setup is a dual-package Nehalem box (X5550,
>  Hyper-Threading disabled) with a dual 10G intel card (pci-id:
>  8086:10fb).  Because this NIC is freakishly new, I'm running the
>  latest torvalds kernel in order to get the ixgbe driver to identify it
>  (<sigh>).  With HT off, I've got 8 cores in the system.  For the sake
>  of reducing the number of variables that I'm dealing with, I'm only
>  using one of the NICs to start with and simply routing packets back
>  out the single 10G NIC.

OK.

We have done quite a bit of 10G testing.
Ill comment based on our experiences.

>
>  Interrupts...
>  I've disabled irqbalance and I'm explicitly pinning interrupts, one
>  per core, as follows:

Setting affinity is a must yes, for high performance.

It is also important that tx affinity matches rx-affinity. So the
TX-completion runs on the same CPU as rx.

>
>  -bash-3.2# for x in 57 65; do for i in `seq 0 7`; do echo $i | awk
>  '{printf("%X", (2^$1));}' > /proc/irq/$(($i + $x))/smp_affinity; done;
>  done
>
>  -bash-3.2# for i in `seq 57 72`; do cat /proc/irq/$i/smp_affinity; done
>  0001
>  0002
>  0004
>  0008
>  0010
>  0020
>  0040
>  0080
>  0001
>  0002
>  0004
>  0008
>  0010
>  0020
>  0040
>  0080
>
>  -bash-3.2# cat /proc/interrupts  | grep eth2
>   57:      77941          0          0          0          0
>  0          0          0   PCI-MSI-edge      eth2-rx-0
>   58:         92      59682          0          0          0
>  0          0          0   PCI-MSI-edge      eth2-rx-1
>   59:         92          0      21716          0          0
>  0          0          0   PCI-MSI-edge      eth2-rx-2
>   60:         92          0          0      14356          0
>  0          0          0   PCI-MSI-edge      eth2-rx-3
>   61:         92          0          0          0      91483
>  0          0          0   PCI-MSI-edge      eth2-rx-4
>   62:         92          0          0          0          0
>  19495          0          0   PCI-MSI-edge      eth2-rx-5
>   63:         92          0          0          0          0
>  0         24          0   PCI-MSI-edge      eth2-rx-6
>   64:         92          0          0          0          0
>  0          0      19605   PCI-MSI-edge      eth2-rx-7
>   65:      94709          0          0          0          0
>  0          0          0   PCI-MSI-edge      eth2-tx-0
>   66:         92         24          0          0          0
>  0          0          0   PCI-MSI-edge      eth2-tx-1
>   67:         98          0         24          0          0
>  0          0          0   PCI-MSI-edge      eth2-tx-2
>   68:         92          0          0     100208          0
>  0          0          0   PCI-MSI-edge      eth2-tx-3
>   69:         92          0          0          0         24
>  0          0          0   PCI-MSI-edge      eth2-tx-4
>   70:         92          0          0          0          0
>  24          0          0   PCI-MSI-edge      eth2-tx-5
>   71:         92          0          0          0          0
>  0     144566          0   PCI-MSI-edge      eth2-tx-6
>   72:         92          0          0          0          0
>  0          0         24   PCI-MSI-edge      eth2-tx-7
>   73:          2          0          0          0          0
>  0          0          0   PCI-MSI-edge      eth2:lsc
>
>  The output of /proc/interrupts is hinting at the problem that I'm
>  having...  The TX queues which are being chosen are only 0, 3, and 6.
>  The flow of traffic that I'm generating is random source/dest pairs,
>  each within a /24, so I don't think that I'm sending data that should
>  be breaking the skb_tx_hash() routine.

RX-side looks good. TX-side looks like what we also got with vanilla linux.

What we do is patch all drivers with a custom select_queue function
that selects the same outgoing queue as the incoming queue. With a one
to one mapping of queues to CPUs you can also use the processor id.

This way we get performance.

Another way we are looking at is to use an abstraction to help with
the queue mapping. (We call it 'flowtrunk'). This is then configurable
from userspace.


>
>  Further, when I run top, I see that almost all of the interrupt
>  processing is happening on a single cpu.
>  Cpu0  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
>  Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
>  Cpu2  :  0.0%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
>  Cpu3  :  0.0%us,  0.0%sy,  0.0%ni, 99.0%id,  0.0%wa,  0.3%hi,  0.7%si,  0.0%st
>  Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
>  Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
>  Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa, 19.3%hi, 80.7%si,  0.0%st
>  Cpu7  :  0.0%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
>
>  This appears to be due to 'tx'-based activity... if I change my route
>  table to blackhole the traffic, the CPUs are nearly idle.
>
>  My next thought was to try multiqueue...
>  -bash-3.2# ./tc/tc qdisc add dev eth2 root handle 1: multiq
>  -bash-3.2# ./tc/tc qdisc show dev eth2
>  qdisc multiq 1: root refcnt 128 bands 8/128
>
>  With multiq scheduling, the CPU load evens out a bunch, but I still
>  have a soft-interrupt hot-spot (see CPU3 here.  Also note that only
>  CPU's 0, 3, and 6 are handling hardware interrupts.):
>  Cpu0  :  0.0%us,  0.0%sy,  0.0%ni, 69.9%id,  0.0%wa,  0.3%hi, 29.8%si,  0.0%st
>  Cpu1  :  0.0%us,  0.0%sy,  0.0%ni, 64.8%id,  0.0%wa,  0.0%hi, 35.2%si,  0.0%st
>  Cpu2  :  0.0%us,  0.0%sy,  0.0%ni, 76.5%id,  0.0%wa,  0.0%hi, 23.5%si,  0.0%st
>  Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,  4.8%id,  0.0%wa,  2.6%hi, 92.6%si,  0.0%st
>  Cpu4  :  0.3%us,  0.3%sy,  0.0%ni, 76.2%id,  0.3%wa,  0.0%hi, 22.8%si,  0.0%st
>  Cpu5  :  0.0%us,  0.0%sy,  0.0%ni, 49.4%id,  0.0%wa,  0.0%hi, 50.6%si,  0.0%st
>  Cpu6  :  0.0%us,  0.0%sy,  0.0%ni, 56.8%id,  0.0%wa,  1.0%hi, 42.3%si,  0.0%st
>  Cpu7  :  0.0%us,  0.0%sy,  0.0%ni, 51.6%id,  0.0%wa,  0.0%hi, 48.4%si,  0.0%st
>
>  However, what I see with multiqueue enabled is that I'm dropping 80%
>  of my traffic (which appears to be due to a large number of
>  'rx_missed_errors').
>
>  Any thoughts on what I'm doing wrong or where I should continue to look?

Changing the qdisc wont help since all qdiscs but pfifo_fast
serializes all CPUs to one qdisc. pfifo_fast creates a separate qdisc
per tx_queue.

If you dont want to patch the kernel you can try increasing the queue
length of the pfifo_fast qdisc.

Cheers,
Jens

>
>  -Andrew
>
> --
>  To unsubscribe from this list: send the line "unsubscribe netdev" in
>  the body of a message to majordomo@...r.kernel.org
>  More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html