[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <49F71A00.5090701@myri.com>
Date: Tue, 28 Apr 2009 11:00:16 -0400
From: Andrew Gallatin <gallatin@...i.com>
To: Herbert Xu <herbert@...dor.apana.org.au>
CC: David Miller <davem@...emloft.net>, brice@...i.com,
sgruszka@...hat.com, netdev@...r.kernel.org
Subject: Re: [PATCH] myr10ge: again fix lro_gen_skb() alignment
Herbert Xu wrote:
> On Mon, Apr 27, 2009 at 04:05:01PM +0800, Herbert Xu wrote:
>> On Fri, Apr 24, 2009 at 12:16:08PM -0400, Andrew Gallatin wrote:
>>> These results are indeed quite close, so the performance problem seems
>>> isolated to AMD CPUS, and perhaps due to the smaller caches.
>>> Do you have any AMD you can use as a receiver?
>> I now have an AMD with 512K cache to test this. Unfortunately
>> I'd just locked it up before I got a chance to do any serious
>> testing. So it might take a while.
>
> OK that's been fixed up. Indeed the AMD can't do wire speed.
> But still the performance seems comparable. Both of them sit
> between 6600Mb/s and 7100Mb/s. The sender is running at about
> 66% idle in either case.
Its strange, I still consistently see about 1Gb/s better performance
from LRO than GRO on this weak machine (6.5Gb/s LRO, 5.5Gb/s GRO)
when binding everything to the same CPU. Mpstat -P 0 shows roughly
10% more time spent in "soft" when using GRO vs LRO:
GRO:
10:17:45 CPU %user %nice %system %iowait %irq %soft
%idle intr/s
10:17:46 0 0.00 0.00 54.00 0.00 0.00 46.00 0.00
11754.00
10:17:47 0 0.00 0.00 54.00 0.00 1.00 45.00 0.00
11718.00
10:17:48 0 0.00 0.00 47.00 0.00 2.00 51.00 0.00
11639.00
LRO:
10:21:55 CPU %user %nice %system %iowait %irq %soft %idle
intr/s
10:21:56 0 0.00 0.00 66.00 0.00 1.00 33.00 0.00
13228.00
10:21:57 0 0.00 0.00 65.35 0.00 1.98 32.67 0.00
13118.81
10:21:58 0 0.00 0.00 63.00 0.00 1.00 36.00 0.00
13238.00
According to oprofile, the top 20 samples running GRO are:
CPU: AMD64 processors, speed 2050.03 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a
unit mask of 0x00 (No unit mask) count 100000
samples % image name app name
symbol name
4382 30.5408 vmlinux vmlinux
copy_user_generic_string
534 3.7218 myri10ge.ko myri10ge
myri10ge_poll
463 3.2269 vmlinux vmlinux
_raw_spin_lock
394 2.7460 vmlinux vmlinux
rb_get_reader_page
382 2.6624 vmlinux vmlinux
acpi_pm_read
356 2.4812 vmlinux vmlinux
inet_gro_receive
293 2.0421 oprofiled oprofiled (no
symbols)
268 1.8679 vmlinux vmlinux
find_next_bit
268 1.8679 vmlinux vmlinux
tg_shares_up
257 1.7912 vmlinux vmlinux
ring_buffer_consume
247 1.7215 myri10ge.ko myri10ge
myri10ge_alloc_rx_pages
247 1.7215 vmlinux vmlinux
tcp_gro_receive
228 1.5891 vmlinux vmlinux
__free_pages_ok
219 1.5263 vmlinux vmlinux
skb_gro_receive
167 1.1639 vmlinux vmlinux
skb_gro_header
149 1.0385 bash bash (no
symbols)
141 0.9827 vmlinux vmlinux
skb_copy_datagram_iovec
132 0.9200 vmlinux vmlinux
rb_buffer_peek
129 0.8991 vmlinux vmlinux
_raw_spin_unlock
123 0.8573 vmlinux vmlinux
delay_tsc
Nothing really stands out for me. Here is LRO:
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a
unit mask of 0x00 (No unit mask) count 100000
samples % image name app name
symbol name
4884 33.1164 vmlinux vmlinux
copy_user_generic_string
721 4.8888 myri10ge.ko myri10ge
myri10ge_poll
580 3.9327 vmlinux vmlinux
_raw_spin_lock
409 2.7733 vmlinux vmlinux
acpi_pm_read
306 2.0749 vmlinux vmlinux
rb_get_reader_page
293 1.9867 oprofiled oprofiled (no
symbols)
286 1.9392 myri10ge.ko myri10ge
myri10ge_get_frag_header
253 1.7155 vmlinux vmlinux
__lro_proc_segment
250 1.6951 vmlinux vmlinux
rb_buffer_peek
247 1.6748 vmlinux vmlinux
ring_buffer_consume
232 1.5731 vmlinux vmlinux
__free_pages_ok
211 1.4307 myri10ge.ko myri10ge
myri10ge_alloc_rx_pages
206 1.3968 vmlinux vmlinux
tg_shares_up
175 1.1866 vmlinux vmlinux
skb_copy_datagram_iovec
158 1.0713 vmlinux vmlinux
find_next_bit
146 0.9900 vmlinux vmlinux
lro_tcp_ip_check
131 0.8883 oprofile.ko oprofile
op_cpu_buffer_read_entry
127 0.8611 vmlinux vmlinux
delay_tsc
125 0.8476 bash bash (no
symbols)
125 0.8476 vmlinux vmlinux
_raw_spin_unlock
If I can't figure out why LRO is so much faster in some cases, then I
think maybe I'll just put together a patch which keeps LRO, and does
GRO only if LRO is disabled. Kind of ugly, but better than loosing
15% performance on some machines.
Drew
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists