netdev - Re: [PATCH] myr10ge: again fix lro_gen

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <49F87188.9000904@cosmosbay.com>
Date:	Wed, 29 Apr 2009 17:26:00 +0200
From:	Eric Dumazet <dada1@...mosbay.com>
To:	Andrew Gallatin <gallatin@...i.com>
CC:	Herbert Xu <herbert@...dor.apana.org.au>,
	David Miller <davem@...emloft.net>, brice@...i.com,
	sgruszka@...hat.com, netdev@...r.kernel.org
Subject: Re: [PATCH] myr10ge: again fix lro_gen_skb() alignment

Andrew Gallatin a écrit :
> Eric Dumazet wrote:
>> Andrew Gallatin a écrit :
>>> Andrew Gallatin wrote:
>>>> For variety, I grabbed a different "slow" receiver.  This is another
>>>> 2 CPU machine, but a dual-socket single-core opteron (Tyan S2895)
>>>>
>>>> processor       : 0
>>>> vendor_id       : AuthenticAMD
>>>> cpu family      : 15
>>>> model           : 37
>>>> model name      : AMD Opteron(tm) Processor 252
>>> <...>
>>>> The sender was an identical machine running an ancient RHEL4 kernel
>>>> (2.6.9-42.ELsmp) and our downloadable (backported) driver.
>>>> (http://www.myri.com/ftp/pub/Myri10GE/myri10ge-linux.1.4.4.tgz)
>>>> I disabled LRO, on the sender.
>>>>
>>>> Binding the IRQ to CPU0, and the netserver to CPU1 I see 8.1Gb/s with
>>>> LRO and 8.0Gb/s with GRO.
>>> With the recent patch to fix idle CPU time accounting from LKML applied,
>>> it is again possible to trust netperf's service demand (based on %CPU).
>>> So here is raw netperf output for LRO and GRO, bound as above.
>>>
>>> TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
>>> hail1-m.sw.myri.com (10.0.130.167) port 0 AF_INET : cpu bind
>>> Recv   Send    Send                          Utilization       Service
>>> Demand
>>> Socket Socket  Message  Elapsed              Send     Recv     Send
>    Recv
>>> Size   Size    Size     Time     Throughput  local    remote   local
> remote
>>> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB
>>> us/KB
>>>
>>> LRO:
>>>  87380  65536  65536    60.00      8279.36   8.10     77.55    0.160
> 1.535
>>> GRO:
>>>  87380  65536  65536    60.00      8053.19   7.86     85.47    0.160
> 1.739
>>>
>>> The difference is bigger if you disable TCP timestamps (and thus shrink
>>> the packets headers down so they require fewer cachelines):
>>> LRO:
>>>  87380  65536  65536    60.02      7753.55   8.01     74.06    0.169
> 1.565
>>> GRO:
>>>  87380  65536  65536    60.02      7535.12   7.27     84.57    0.158
> 1.839
>>>
>>>
>>> As you can see, even though the raw bandwidth is very close, the
>>> service demand makes it clear that GRO is more expensive
>>> than LRO.  I just wish I understood why.
>>>
>>
>> What are "vmstat 1" ouputs on both tests ? Any difference on say...
> context switches ?
> 
> Not much difference is apparent from vmstat, except for a
> lower load and slightly higher IRQ rate from LRO:
> 
> LRO:
> procs -----------memory---------- ---swap-- -----io---- --system--
> -----cpu------
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy
> id wa st
>  1  0      0 676960  19280 209812    0    0     0     0 14817   24  0 73
> 27  0  0
>  1  0      0 677084  19280 209812    0    0     0     0 14834   20  0 73
> 27  0  0
>  1  0      0 676916  19280 209812    0    0     0     0 14833   16  0 74
> 26  0  0
> 
> 
> GRO:
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy
> id wa st
>  1  0      0 678244  18008 209784    0    0     0    24 14288   32  0 84
> 16  0  0
>  1  0      0 678268  18008 209788    0    0     0     0 14403   22  0 85
> 15  0  0
>  1  0      0 677956  18008 209788    0    0     0     0 14331   20  0 84
> 16  0  0
> 
> 
> 
> 
> The real difference is visible mainly from mpstat on the CPU handing the
> interrupts where you see softirq is much higher:
> 
> LRO:
> 07:15:16     CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal
>   %idle    intr/s
> 07:15:17       0    0.00    0.00    0.00    0.00    0.00   45.00    0.00
>   55.00  12907.92
> 07:15:18       0    0.00    0.00    1.00    0.00    2.00   43.00    0.00
>   54.00  12707.92
> 07:15:19       0    0.00    0.00    1.00    0.00    0.00   46.00    0.00
>   53.00  12825.00
> 
> 
> GRO
> 07:11:59     CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal
>   %idle    intr/s
> 07:12:00       0    0.00    0.00    0.00    0.00    0.99   66.34    0.00
>   32.67  12242.57
> 07:12:01       0    0.00    0.00    0.00    0.00    1.01   66.67    0.00
>   32.32  12220.00
> 07:12:02       0    0.00    0.00    0.99    0.00    0.99   65.35    0.00
>   32.67  12336.00
> 
> 
> So it is like "something" GRO is doing in the softirq context is more
> expensive than what LRO is doing.


Sure, probably more cache misses or something...

You could try a longer oprofile session (with at least one million samples)
and :

opannotate -a vmlinux >/tmp/FILE

And select 3 or 4 suspect functions : inet_gro_receive() tcp_gro_receive(),
skb_gro_receive(), skb_gro_header()


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html