[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <53c61113-dfce-88c6-7711-a308fcf451ad@solarflare.com>
Date: Thu, 15 Nov 2018 18:43:56 +0000
From: Edward Cree <ecree@...arflare.com>
To: Eric Dumazet <eric.dumazet@...il.com>,
<linux-net-drivers@...arflare.com>, <davem@...emloft.net>
CC: <netdev@...r.kernel.org>
Subject: Re: [PATCH v3 net-next 0/4] net: batched receive in GRO path
On 15/11/18 07:22, Eric Dumazet wrote:
> On 11/14/2018 10:07 AM, Edward Cree wrote:
>> Conclusion:
>> * TCP b/w is 16.5% faster for traffic which cannot be coalesced by GRO.
> But only for traffic that actually was perfect GRO candidate, right ?
>
> Now what happens if all the packets you are batching are hitting different TCP sockets ?
The batch is already split up by the time it hits TCP sockets; batching
currently only goes as far as ip_sublist_rcv_finish() which calls
dst_input(skb) in a loop. So as long as the packets are all for the
same dst IP, we should get all of this gain.
If the packets have different dst IP addresses then we split the batch
slightly earlier, in ip_list_rcv_finish(), but that won't make very
much difference, I expect we'll still get most of this gain. There is
a lot of the stack (layer 2 stuff, taps, etc.) that we still traverse
as a batch.
> By the time we build a list of 64 packets, the first packets in the list wont be anymore
> in L1 cache (32 KB 8-way associative typically), and we will probably have cache trashing.
Most of the packet isn't touched and thus won't be brought into cache.
Only the headers of each packet (worst-case let's say 256 bytes) will
be touched during batch processing, that's 16kB. And not all at once
i.e. by the time we touch the later cachelines of a packet we'll be
done with the earlier ones except maybe in cases where GRO decides
very late on that it can't coalesce.
And since the alternative is thrashing of the I$ cache, I don't think
there's an a priori argument that this will hurt — and my tests seem
to indicate that it's OK and that we gain more from better I$ usage
than we lose from worse D$ usage patterns.
If you think there are cases in which the latter will dominate, please
suggest some tests that will embody them; I'm happy to keep running
experiments. Also you could come up with an analogue of patch #2 for
whatever HW you have (it shouldn't be difficult) allowing you to run
your own tests (e.g. if you have bigger/more powerful test rigs than
I have access to ;-)
-Ed
Powered by blists - more mailing lists