[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <adf92243-689e-6013-293f-5464af317594@redhat.com>
Date: Tue, 10 Jan 2023 15:52:48 +0100
From: Jesper Dangaard Brouer <jbrouer@...hat.com>
To: Alexander H Duyck <alexander.duyck@...il.com>,
Jakub Kicinski <kuba@...nel.org>,
Jesper Dangaard Brouer <jbrouer@...hat.com>
Cc: brouer@...hat.com, netdev@...r.kernel.org,
"David S. Miller" <davem@...emloft.net>, edumazet@...gle.com,
pabeni@...hat.com
Subject: Re: [PATCH net-next 2/2] net: kfree_skb_list use kmem_cache_free_bulk
On 09/01/2023 23.10, Alexander H Duyck wrote:
> On Mon, 2023-01-09 at 11:34 -0800, Jakub Kicinski wrote:
>> On Mon, 9 Jan 2023 13:24:54 +0100 Jesper Dangaard Brouer wrote:
>>>> Also the lack of perf numbers is a bit of a red flag.
>>>>
>>>
>>> I have run performance tests, but as I tried to explain in the
>>> cover letter, for the qdisc use-case this code path is only activated
>>> when we have overflow at enqueue. Thus, this doesn't translate directly
>>> into a performance numbers, as TX-qdisc is 100% full caused by hardware
>>> device being backed up, and this patch makes us use less time on freeing
>>> memory.
>>
>> I guess it's quite subjective, so it'd be good to get a third opinion.
>> To me that reads like a premature optimization. Saeed asked for perf
>> numbers, too.
>>
>> Does anyone on the list want to cast the tie-break vote?
>
> I'd say there is some value to be gained by this. Basically it means
> less overhead for dropping packets if we picked a backed up Tx path.
>
Thanks.
I have microbenchmarks[1] of kmem_cache bulking, which I use to assess
what is the (best-case) expected gain of using the bulk APIs.
The module 'slab_bulk_test01' results at bulk 16 element:
kmem-in-loop Per elem: 109 cycles(tsc) 30.532 ns (step:16)
kmem-bulk Per elem: 64 cycles(tsc) 17.905 ns (step:16)
Thus, best-case expected gain is: 45 cycles(tsc) 12.627 ns.
- With usual microbenchmarks caveats
- Notice this is both bulk alloc and free
[1] https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm
>>> I have been using pktgen script ./pktgen_bench_xmit_mode_queue_xmit.sh
>>> which can inject packets at the qdisc layer (invoking __dev_queue_xmit).
>>> And then used perf-record to see overhead of SLUB (__slab_free is top#4)
>>> is reduced.
>>
>> Right, pktgen wasting time while still delivering line rate is not of
>> practical importance.
>
I better explain how I cause the push-back without hitting 10Gbit/s line
rate (as we/Linux cannot allocated SKBs fast enough for this).
I'm testing this on a 10Gbit/s interface (driver ixgbe). The challenge
is that I need to overload the qdisc enqueue layer as that is triggering
the call to kfree_skb_list().
Linux with SKBs and qdisc injecting with pktgen is limited to producing
packets at (measured) 2,205,588 pps with a single TX-queue (and scaling
up 1,951,771 pps per queue or 512 ns per pkt). Reminder 10Gbit/s at 64
bytes packets is 14.8 Mpps (or 67.2 ns per pkt).
The trick to trigger the qdisc push-back way earlier is Ethernet
flow-control (which is on by default).
I was a bit surprised to see, but using pktgen_bench_xmit_mode_queue_xmit.sh
on my testlab the remote host was pushing back a lot, resulting in only
256Kpps being actually sent on wire. Monitored with ethtool stats script[2].
[2]
https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
> I suspect there are probably more real world use cases out there.
> Although to test it you would probably have to have a congested network
> to really be able to show much of a benefit.
>
> With the pktgen I would be interested in seeing the Qdisc dropped
> numbers for with vs without this patch. I would consider something like
> that comparable to us doing an XDP_DROP test since all we are talking
> about is a synthetic benchmark.
The pktgen script output how many packets it have transmitted, but from
above we know that this most of these packets are actually getting
dropped as only 256Kpps are reaching the wire.
Result line from pktgen script: count 100000000 (60byte,0frags)
- Unpatched kernel: 2396594pps 1150Mb/sec (1150365120bps) errors: 1417469
- Patched kernel : 2479970pps 1190Mb/sec (1190385600bps) errors: 1422753
Difference:
* +83376 pps faster (2479970-2396594)
* -14 nanosec faster (1/2479970-1/2396594)*10^9
The patched kernel is faster. Around the expected gain from using the
kmem_cache bulking API (slightly more actually).
More raw data and notes for this email avail in [3]:
[3]
https://github.com/xdp-project/xdp-project/blob/master/areas/mem/kfree_skb_list01.org
>>
>>>> kfree_skb_list_bulk() ?
>>>
>>> Hmm, IMHO not really worth changing the function name. The
>>> kfree_skb_list() is called in more places, (than qdisc enqueue-overflow
>>> case), which automatically benefits if we keep the function name
>>> kfree_skb_list().
>>
>> To be clear - I was suggesting a simple
>> s/kfree_skb_defer_local/kfree_skb_list_bulk/
>> on the patch, just renaming the static helper.
>>
Okay, I get it now. But I disagree with same argument as Alex makes below.
>> IMO now that we have multiple freeing optimizations using "defer" for
>> the TCP scheme and "bulk" for your prior slab bulk optimizations would
>> improve clarity.
>
> Rather than defer_local would it maybe make more sense to look at
> naming it something like "kfree_skb_add_bulk"? Basically we are
> building onto the list of buffers to free so I figure something like an
> "add" or "append" would make sense.
>
I agree with Alex, that we are building up buffers to be freed *later*,
thus we should somehow reflect that in the naming.
--Jesper
Powered by blists - more mailing lists