[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <df3cf9f9-d40c-6d44-872b-8064c046bda6@solarflare.com>
Date: Thu, 15 Nov 2018 21:45:37 +0000
From: Edward Cree <ecree@...arflare.com>
To: Eric Dumazet <eric.dumazet@...il.com>,
<linux-net-drivers@...arflare.com>, <davem@...emloft.net>
CC: <netdev@...r.kernel.org>
Subject: Re: [PATCH v3 net-next 0/4] net: batched receive in GRO path
On 15/11/18 20:08, Eric Dumazet wrote:
> On 11/15/2018 10:43 AM, Edward Cree wrote:
>
> Most of the packet isn't touched and thus won't be brought into cache.
>> Only the headers of each packet (worst-case let's say 256 bytes) will
>> be touched during batch processing, that's 16kB.
> You assume perfect use of the caches, but part of the cache has collisions.
I assume nothing, that's why I'm running lots of tests & benchmarks.
Remember that gains from batching are not only in I$; the D$ cache is
also going to be used for things like route lookups and netfilter
progs, and locality for those is improved by batching.
It might be possible to use PMCs to get hard numbers on how I$ and D$
hit & eviction rates change, idk how useful that would be.
> I am alarmed by the complexity added, for example in GRO, considering
> that we also added GRO for UDP.
This series doesn't really add complexity _in_ GRO, it's more a piece
on the outside that's calling GRO machinery slightly differently.
Drivers which just call the existing non-list-based entry points won't
even see any of this code.
> I dunno, can you show us for example if a reassembly workload can benefit
> from all this stuff ?
Sure, I can try a UDP test with payload_size > MTU. (I can't think of a
way to force interleaving of fragments from different packets, though.)
> If you present numbers for traffic that GRO handles just fine, it does not
> really make sense, unless your plan maybe is to remove GRO completely ?
That's just the easiest thing to test. It's that much harder to set up
tests to use e.g. IP options that GRO will baulk at. It's also not too
easy to create traffic with the kind of flow interleaving that DDoS
scenarios would present, as that requires something like a many-to-one
rig with a switch and I don't have enough lab machines for such a test.
I'm not planning to remove GRO. GRO is faster than batched receive.
Batched receive, however, works equally well for all traffic whether it's
GRO-able or not.
Thus both are worth having. This patch series is about using batched
receive for packets that GRO looks at and says "no thanks".
> We have observed at Google a constant increase of cpu cycles spent for TCP_RR
> on latest kernels. The gap is now about 20% with kernels from two years ago,
> and I could not yet find a faulty commit. It seems we add little overhead after
> another, and every patch author is convinced he is doing the right thing.
>
> With multi queue NICS, vast majority of napi->poll() invocations handle only one packet.
> Unfortunately we can not really increase interrupt mitigations (ethtool -c)
> on NIC without sacrificing latencies.
At one point when I was working on the original batching patches, I tried
making them skip batching if poll() hadn't used up the entire NAPI budget
(as a signal that we're not BW-constrained), but it didn't seem to yield
any benefit. However I could try it again, or try checking the list
length and handling packets singly if it's less than some threshold...?
If napi->poll() is only handling one packet, surely GRO can't do anything
useful either? (AIUI at the end of the poll the GRO lists get flushed.)
Is it maybe a sign that you're just spreading over too many queues??
-Ed
Powered by blists - more mailing lists