netdev - Re: [PATCH v3 net-next 0/4] net: batched receive in GRO path

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <53c61113-dfce-88c6-7711-a308fcf451ad@solarflare.com>
Date:   Thu, 15 Nov 2018 18:43:56 +0000
From:   Edward Cree <ecree@...arflare.com>
To:     Eric Dumazet <eric.dumazet@...il.com>,
        <linux-net-drivers@...arflare.com>, <davem@...emloft.net>
CC:     <netdev@...r.kernel.org>
Subject: Re: [PATCH v3 net-next 0/4] net: batched receive in GRO path

On 15/11/18 07:22, Eric Dumazet wrote:
> On 11/14/2018 10:07 AM, Edward Cree wrote:
>> Conclusion:
>> * TCP b/w is 16.5% faster for traffic which cannot be coalesced by GRO.
> But only for traffic that actually was perfect GRO candidate, right ?
>
> Now what happens if all the packets you are batching are hitting different TCP sockets ?
The batch is already split up by the time it hits TCP sockets; batching
 currently only goes as far as ip_sublist_rcv_finish() which calls
 dst_input(skb) in a loop.  So as long as the packets are all for the
 same dst IP, we should get all of this gain.
If the packets have different dst IP addresses then we split the batch
 slightly earlier, in ip_list_rcv_finish(), but that won't make very
 much difference, I expect we'll still get most of this gain.  There is
 a lot of the stack (layer 2 stuff, taps, etc.) that we still traverse
 as a batch.

> By the time we build a list of 64 packets, the first packets in the list wont be anymore
> in L1 cache (32 KB 8-way associative typically), and we will probably have cache trashing.
Most of the packet isn't touched and thus won't be brought into cache.
Only the headers of each packet (worst-case let's say 256 bytes) will
 be touched during batch processing, that's 16kB.  And not all at once
 i.e. by the time we touch the later cachelines of a packet we'll be
 done with the earlier ones except maybe in cases where GRO decides
 very late on that it can't coalesce.
And since the alternative is thrashing of the I$ cache, I don't think
 there's an a priori argument that this will hurt — and my tests seem
 to indicate that it's OK and that we gain more from better I$ usage
 than we lose from worse D$ usage patterns.
If you think there are cases in which the latter will dominate, please
 suggest some tests that will embody them; I'm happy to keep running
 experiments.  Also you could come up with an analogue of patch #2 for
 whatever HW you have (it shouldn't be difficult) allowing you to run
 your own tests (e.g. if you have bigger/more powerful test rigs than
 I have access to ;-)

-Ed