[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <D12839161ADD3A4B8DA63D1A134D084026E48BA6B0@ESGSCCMS0001.eapac.ericsson.se>
Date: Sat, 9 Apr 2011 11:51:23 +0800
From: Wei Gu <wei.gu@...csson.com>
To: Stephen Hemminger <shemminger@...tta.com>
CC: Eric Dumazet <eric.dumazet@...il.com>,
Alexander Duyck <alexander.h.duyck@...el.com>,
netdev <netdev@...r.kernel.org>,
"Kirsher, Jeffrey T" <jeffrey.t.kirsher@...el.com>
Subject: RE: Low performance Intel 10GE NIC (3.2.10) on 2.6.38 Kernel
Hi Stephen,
Thanks for you reply,
I do awared that local Bus, local CPU socket would gain +20% performance, that why I put the eth10 on core 24-31 (sock 3) and NUMA Node 2/3.
Thanks
WeiGu
-----Original Message-----
From: Stephen Hemminger [mailto:shemminger@...tta.com]
Sent: Friday, April 08, 2011 10:49 PM
To: Wei Gu
Cc: Eric Dumazet; Alexander Duyck; netdev; Kirsher, Jeffrey T
Subject: Re: Low performance Intel 10GE NIC (3.2.10) on 2.6.38 Kernel
On Fri, 8 Apr 2011 22:10:50 +0800
Wei Gu <wei.gu@...csson.com> wrote:
> Hi,
> Got you mean.
> But as I decribed before, I start the eth10 with 8 rx queues and 8 tx
> queues, and then I binding these 8 tx&rx queue each to CPU core 24-32
> (NUMA3), which I think could gain the best performance in my case
> (It's true on Linux 2.6.32) single queue ->single CPU Then I can
> descibe a little bit with packet generator, I config the IXIA to
> continues increase the dest ip address towards the test server, so the
> packet was evenly distributed to each receving queues of the eth10.
> And according the IXIA tools the transmit sharp was really good, no
> too much peaks
>
> What I observed on Linux 2.6.38 during the test, there is no softqd
> was stressed (< 03% on SI for each core(24-31)) while the packet lost
> happens, so we are not really stress the CPU:), It looks like we are
> limited on some memory bandwidth (DMA) on this release
>
> And with same test case on 2.6.32, no such problem at all. It running
> pretty stable > 2Mpps without rx_missing_error. There is no HW
> limitation on this DL580
>
>
> BTW what is these "swapper"
> + 0.80% swapper [ixgbe] [k] ixgbe_poll
> + 0.79% perf [ixgbe] [k] ixgbe_poll
> Why the ixgbe_poll was on swapper/perf?
>
> Thanks
> WeiGu
>
> -----Original Message-----
> From: Eric Dumazet [mailto:eric.dumazet@...il.com]
> Sent: Friday, April 08, 2011 8:57 PM
> To: Wei Gu
> Cc: Alexander Duyck; netdev; Kirsher, Jeffrey T
> Subject: RE: Low performance Intel 10GE NIC (3.2.10) on 2.6.38 Kernel
>
> Le vendredi 08 avril 2011 à 20:19 +0800, Wei Gu a écrit :
> > Hi again,
> > I tried more testing with by disable this CONFIG_DMAR with shipped
> > 2.6.38 ixgbe and Intel released 3.2.10/3.1.15.
> > All these test looks we can get >1Mpps 400bype packtes but not
> > stable at all, there will huge number missing errors with 100% CPU IDLE:
> > ethtool -S eth10 |grep rx_missed_errors
> >
> > rx_missed_errors: 76832040
> >
> > SUM: 1102212 ETH8: 0 ETH10: 1102212 ETH6: 0 ETH4: 0
> > SUM: 521841 ETH8: 0 ETH10: 521841 ETH6: 0 ETH4: 0
> > SUM: 426776 ETH8: 0 ETH10: 426776 ETH6: 0 ETH4: 0
> > SUM: 927520 ETH8: 0 ETH10: 927520 ETH6: 0 ETH4: 0
> > SUM: 1171995 ETH8: 0 ETH10: 1171995 ETH6: 0 ETH4: 0
> > SUM: 855980 ETH8: 0 ETH10: 855980 ETH6: 0 ETH4: 0
> >
> >
> > Do you know if there is other options in the kernel will cause high
> > rate rx_missed_errors with low CPU usage. (No problem on 2.6.32 with
> > same test case)
> >
> > perf record:
> > + 69.74% swapper [kernel.kallsyms] [k] poll_idle
> > + 11.62% swapper [kernel.kallsyms] [k] intel_idle
> > + 0.80% swapper [ixgbe] [k] ixgbe_poll
> > + 0.79% perf [ixgbe] [k] ixgbe_poll
> > + 0.77% perf [kernel.kallsyms] [k] skb_copy_bits
> > + 0.64% swapper [kernel.kallsyms] [k] skb_copy_bits
> > + 0.48% perf [kernel.kallsyms] [k] __kmalloc_node_track_caller
> > + 0.44% swapper [kernel.kallsyms] [k] __kmalloc_node_track_caller
> > + 0.36% swapper [kernel.kallsyms] [k] kmem_cache_alloc_node
> > + 0.35% swapper [kernel.kallsyms] [k] kfree
> > + 0.35% perf [kernel.kallsyms] [k] kmem_cache_alloc_node
> >
>
>
> Make sure enough cpus serves interrupts, _before_ even starting your stress test.
>
> Then, make sure trafic is distributed to many different queues.
> If a single flow is used, it probably uses a single queue ->single CPU.
>
> Say you have irq affinities set to fffffffffffff (all cpus able to
> serve IRQ X,Y,Z,T,...)
>
> Then you have a network burst (because you start your packet generator at full rate), spreaded on many queues.
>
> CPU0 takes hard interrupt for queue 0, eth8, and queues NAPI mode.
> CPU0 takes hard interrupt for queue 0, eth10, and queues NAPI mode.
> CPU0 takes hard interrupt for queue 1, eth8, and queues NAPI mode.
> CPU0 takes hard interrupt for queue 1, eth10, and queues NAPI mode.
> CPU0 takes hard interrupt for queue 2, eth8, and queues NAPI mode.
> CPU0 takes hard interrupt for queue 2, eth10, and queues NAPI mode.
> ...
> CPU0 takes hard interrupt for queue X, eth8, and queues NAPI mode.
> ...
>
> Then softirq can start, and only CPU0 is able to handle NAPI for all the queued devices. You are stuck, with CPU0 never leaving ksoftirqd.
>
> NAPI handling is always performed on the CPU that received the hardware interrupt, until we exit NAPI (and rearm interrupt delivery).
> It cannot migrate to an "idle cpu"
For performance, you need to assign each network interrupt to a single CPU. There is no load balancing effect in the IRQ controller.
If you have a multi-socket system, then it is a good idea to make the IRQ's for the NIC's be on the same socket as the bus interface. Multi socket systems are really NUMA and putting IRQ on non-local CPU has measurable impact.
--
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists