netdev - Re: [Patch bpf-next v4 0/4] tcp_bpf: improve ingress redirection performance with message corking

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <7b657a5d-0a79-49da-89fa-2d5e06e0cabb@bytedance.com>
Date: Thu, 3 Jul 2025 09:48:36 +0800
From: Zijian Zhang <zijianzhang@...edance.com>
To: Jakub Sitnicki <jakub@...udflare.com>,
 Cong Wang <xiyou.wangcong@...il.com>
Cc: netdev@...r.kernel.org, bpf@...r.kernel.org, john.fastabend@...il.com,
 zhoufeng.zf@...edance.com
Subject: Re: [Patch bpf-next v4 0/4] tcp_bpf: improve ingress redirection
 performance with message corking

On 7/2/25 6:22 PM, Jakub Sitnicki wrote:
> On Mon, Jun 30, 2025 at 06:11 PM -07, Cong Wang wrote:
>> This patchset improves skmsg ingress redirection performance by a)
>> sophisticated batching with kworker; b) skmsg allocation caching with
>> kmem cache.
>>
>> As a result, our patches significantly outperforms the vanilla kernel
>> in terms of throughput for almost all packet sizes. The percentage
>> improvement in throughput ranges from 3.13% to 160.92%, with smaller
>> packets showing the highest improvements.
>>
>> For latency, it induces slightly higher latency across most packet sizes
>> compared to the vanilla, which is also expected since this is a natural
>> side effect of batching.
>>
>> Here are the detailed benchmarks:
>>
>> +-------------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
>> | Throughput  | 64     | 128    | 256    | 512    | 1k     | 4k     | 16k    | 32k    | 64k    | 128k   | 256k   |
>> +-------------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
>> | Vanilla     | 0.17±0.02 | 0.36±0.01 | 0.72±0.02 | 1.37±0.05 | 2.60±0.12 | 8.24±0.44 | 22.38±2.02 | 25.49±1.28 | 43.07±1.36 | 66.87±4.14 | 73.70±7.15 |
>> | Patched     | 0.41±0.01 | 0.82±0.02 | 1.62±0.05 | 3.33±0.01 | 6.45±0.02 | 21.50±0.08 | 46.22±0.31 | 50.20±1.12 | 45.39±1.29 | 68.96±1.12 | 78.35±1.49 |
>> | Percentage  | 141.18%   | 127.78%   | 125.00%   | 143.07%   | 148.08%   | 160.92%   | 106.52%    | 97.00%     | 5.38%      | 3.13%      | 6.32%      |
>> +-------------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
> 
> That's a bit easier to read when aligned:
> 
> | Throughput |    64     |    128    |    256    |    512    |    1k     |     4k     |    16k     |    32k     |    64k     |    128k    |    256k    |
> |------------+-----------+-----------+-----------+-----------+-----------+------------+------------+------------+------------+------------+------------|
> |    Vanilla | 0.17±0.02 | 0.36±0.01 | 0.72±0.02 | 1.37±0.05 | 2.60±0.12 | 8.24±0.44  | 22.38±2.02 | 25.49±1.28 | 43.07±1.36 | 66.87±4.14 | 73.70±7.15 |
> |    Patched | 0.41±0.01 | 0.82±0.02 | 1.62±0.05 | 3.33±0.01 | 6.45±0.02 | 21.50±0.08 | 46.22±0.31 | 50.20±1.12 | 45.39±1.29 | 68.96±1.12 | 78.35±1.49 |
> | Percentage |  141.18%  |  127.78%  |  125.00%  |  143.07%  |  148.08%  |  160.92%   |  106.52%   |   97.00%   |   5.38%    |   3.13%    |   6.32%    |
> 

Thanks for the suggestion!

>>
>> +-------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
>> | Latency     | 64        | 128       | 256       | 512       | 1k        | 4k        | 16k       | 32k       | 63k       |
>> +-------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
>> | Vanilla     | 5.80±4.02 | 5.83±3.61 | 5.86±4.10 | 5.91±4.19 | 5.98±4.14 | 6.61±4.47 | 8.60±2.59 | 10.96±5.50| 15.02±6.78|
>> | Patched     | 6.18±3.03 | 6.23±4.38 | 6.25±4.44 | 6.13±4.35 | 6.32±4.23 | 6.94±4.61 | 8.90±5.49 | 11.12±6.10| 14.88±6.55|
>> | Percentage  | 6.55%     | 6.87%     | 6.66%     | 3.72%     | 5.68%     | 4.99%     | 3.49%     | 1.46%     |-0.93%     |
>> +-------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
> 
> What are throughput and latency units here?
> 
> Which microbenchmark was used?

Let me add some details here,

# Tput Test: iperf 3.18 (cJSON 1.7.15)
# unit is Gbits/sec
iperf3 -4 -s
iperf3 -4 -c $local_host -l $buffer_length

During this process, some meta data will be exchanged between server
and client via TCP, it also verifies the data integrity of our patched
code.

# Latency Test: sockperf, version 3.10-31
# unit is us
sockperf server -i $local_host --tcp --daemonize
sockperf ping-pong -i $local_host --tcp --time 10 --sender-affinity 0 
--receiver-affinity 1