netdev - RE: Low performance Intel 10GE NIC (3.2.10) on 2.6.38 Kernel

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <D12839161ADD3A4B8DA63D1A134D084026E48BA6B0@ESGSCCMS0001.eapac.ericsson.se>
Date:	Sat, 9 Apr 2011 11:51:23 +0800
From:	Wei Gu <wei.gu@...csson.com>
To:	Stephen Hemminger <shemminger@...tta.com>
CC:	Eric Dumazet <eric.dumazet@...il.com>,
	Alexander Duyck <alexander.h.duyck@...el.com>,
	netdev <netdev@...r.kernel.org>,
	"Kirsher, Jeffrey T" <jeffrey.t.kirsher@...el.com>
Subject: RE: Low performance Intel 10GE NIC (3.2.10) on 2.6.38 Kernel

Hi Stephen,
Thanks for you reply,
I do awared that local Bus, local CPU socket would gain +20% performance, that why I put the eth10 on core 24-31 (sock 3) and NUMA Node 2/3.

Thanks
WeiGu

-----Original Message-----
From: Stephen Hemminger [mailto:shemminger@...tta.com]
Sent: Friday, April 08, 2011 10:49 PM
To: Wei Gu
Cc: Eric Dumazet; Alexander Duyck; netdev; Kirsher, Jeffrey T
Subject: Re: Low performance Intel 10GE NIC (3.2.10) on 2.6.38 Kernel

On Fri, 8 Apr 2011 22:10:50 +0800
Wei Gu <wei.gu@...csson.com> wrote:

> Hi,
> Got you mean.
> But as I decribed before, I start the eth10 with 8 rx queues and 8 tx
> queues, and then I binding these 8 tx&rx queue each to CPU core 24-32
> (NUMA3), which I think could gain the best performance in my case
> (It's true on Linux 2.6.32) single queue ->single CPU Then I can
> descibe a little bit with packet generator, I config the IXIA to
> continues increase the dest ip address towards the test server, so the
> packet was evenly distributed to each receving queues of the eth10.
> And according the IXIA tools the transmit sharp was really good, no
> too much peaks
>
> What I observed on Linux 2.6.38 during the test, there is no softqd
> was stressed (< 03% on SI for each core(24-31)) while the packet lost
> happens, so we are not really stress the CPU:), It looks like we are
> limited  on some memory bandwidth (DMA) on this release
>
> And with same test case on 2.6.32, no such problem at all. It running
> pretty stable > 2Mpps without rx_missing_error. There is no HW
> limitation on this DL580
>
>
> BTW what is these "swapper"
> +      0.80%          swapper  [ixgbe]                    [k] ixgbe_poll
> +      0.79%             perf  [ixgbe]                    [k] ixgbe_poll
> Why the ixgbe_poll was on swapper/perf?
>
> Thanks
> WeiGu
>
> -----Original Message-----
> From: Eric Dumazet [mailto:eric.dumazet@...il.com]
> Sent: Friday, April 08, 2011 8:57 PM
> To: Wei Gu
> Cc: Alexander Duyck; netdev; Kirsher, Jeffrey T
> Subject: RE: Low performance Intel 10GE NIC (3.2.10) on 2.6.38 Kernel
>
> Le vendredi 08 avril 2011 à 20:19 +0800, Wei Gu a écrit :
> > Hi again,
> > I tried more testing with by disable this CONFIG_DMAR with shipped
> > 2.6.38 ixgbe and Intel released 3.2.10/3.1.15.
> > All these test looks we can get >1Mpps 400bype packtes but not
> > stable at all, there will huge number missing errors with 100% CPU IDLE:
> > ethtool -S eth10 |grep rx_missed_errors
> >
> >         rx_missed_errors: 76832040
> >
> > SUM: 1102212 ETH8: 0  ETH10: 1102212 ETH6: 0 ETH4: 0
> > SUM: 521841 ETH8: 0  ETH10: 521841 ETH6: 0 ETH4: 0
> > SUM: 426776 ETH8: 0  ETH10: 426776 ETH6: 0 ETH4: 0
> > SUM: 927520 ETH8: 0  ETH10: 927520 ETH6: 0 ETH4: 0
> > SUM: 1171995 ETH8: 0  ETH10: 1171995 ETH6: 0 ETH4: 0
> > SUM: 855980 ETH8: 0  ETH10: 855980 ETH6: 0 ETH4: 0
> >
> >
> > Do you know if there is other options in the kernel will cause high
> > rate rx_missed_errors with low CPU usage. (No problem on 2.6.32 with
> > same test case)
> >
> > perf  record:
> > +     69.74%          swapper  [kernel.kallsyms]          [k] poll_idle
> > +     11.62%          swapper  [kernel.kallsyms]          [k] intel_idle
> > +      0.80%          swapper  [ixgbe]                    [k] ixgbe_poll
> > +      0.79%             perf  [ixgbe]                    [k] ixgbe_poll
> > +      0.77%             perf  [kernel.kallsyms]          [k] skb_copy_bits
> > +      0.64%          swapper  [kernel.kallsyms]          [k] skb_copy_bits
> > +      0.48%             perf  [kernel.kallsyms]          [k] __kmalloc_node_track_caller
> > +      0.44%          swapper  [kernel.kallsyms]          [k] __kmalloc_node_track_caller
> > +      0.36%          swapper  [kernel.kallsyms]          [k] kmem_cache_alloc_node
> > +      0.35%          swapper  [kernel.kallsyms]          [k] kfree
> > +      0.35%             perf  [kernel.kallsyms]          [k] kmem_cache_alloc_node
> >
>
>
> Make sure enough cpus serves interrupts, _before_ even starting your stress test.
>
> Then, make sure trafic is distributed to many different queues.
> If a single flow is used, it probably uses a single queue ->single CPU.
>
> Say you have irq affinities set to fffffffffffff  (all cpus able to
> serve IRQ X,Y,Z,T,...)
>
> Then you have a network burst (because you start your packet generator at full rate), spreaded on many queues.
>
> CPU0 takes hard interrupt for queue 0, eth8, and queues NAPI mode.
> CPU0 takes hard interrupt for queue 0, eth10, and queues NAPI mode.
> CPU0 takes hard interrupt for queue 1, eth8, and queues NAPI mode.
> CPU0 takes hard interrupt for queue 1, eth10, and queues NAPI mode.
> CPU0 takes hard interrupt for queue 2, eth8, and queues NAPI mode.
> CPU0 takes hard interrupt for queue 2, eth10, and queues NAPI mode.
> ...
> CPU0 takes hard interrupt for queue X, eth8, and queues NAPI mode.
> ...
>
> Then softirq can start, and only CPU0 is able to handle NAPI for all the queued devices. You are stuck, with CPU0 never leaving ksoftirqd.
>
> NAPI handling is always performed on the CPU that received the hardware interrupt, until we exit NAPI (and rearm interrupt delivery).
> It cannot migrate to an "idle cpu"

For performance, you need to assign each network interrupt to a single CPU. There is no load balancing effect in the IRQ controller.

If you have a multi-socket system, then it is a good idea to make the IRQ's for the NIC's be on the same socket as the bus interface. Multi socket systems are really NUMA and putting IRQ on non-local CPU has measurable impact.



--
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html