netdev - Re: [Patch bpf-next v4 4/4] tcp_bpf: improve ingress redirection performance with message corking

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <509939c4-2e3e-41a6-888f-cbbf6d4c93cb@bytedance.com>
Date: Thu, 3 Jul 2025 10:17:09 +0800
From: Zijian Zhang <zijianzhang@...edance.com>
To: Jakub Sitnicki <jakub@...udflare.com>,
 Cong Wang <xiyou.wangcong@...il.com>
Cc: netdev@...r.kernel.org, bpf@...r.kernel.org, john.fastabend@...il.com,
 zhoufeng.zf@...edance.com, Amery Hung <amery.hung@...edance.com>,
 Cong Wang <cong.wang@...edance.com>
Subject: Re: [Patch bpf-next v4 4/4] tcp_bpf: improve ingress redirection
 performance with message corking

On 7/2/25 8:17 PM, Jakub Sitnicki wrote:
> On Mon, Jun 30, 2025 at 06:12 PM -07, Cong Wang wrote:
>> From: Zijian Zhang <zijianzhang@...edance.com>
>>
>> The TCP_BPF ingress redirection path currently lacks the message corking
>> mechanism found in standard TCP. This causes the sender to wake up the
>> receiver for every message, even when messages are small, resulting in
>> reduced throughput compared to regular TCP in certain scenarios.
> 
> I'm curious what scenarios are you referring to? Is it send-to-local or
> ingress-to-local? [1]
> 

Thanks for your attention and detailed reviewing!
We are referring to "send-to-local" here.

> If the sender is emitting small messages, that's probably intended -
> that is they likely want to get the message across as soon as possible,
> because They must have disabled the Nagle algo (set TCP_NODELAY) to do
> that.
> 
> Otherwise, you get small segment merging on the sender side by default.
> And if MTU is a limiting factor, you should also be getting batching
> from GRO.
> 
> What I'm getting at is that I don't quite follow why you don't see
> sufficient batching before the sockmap redirect today?
> 

IMHO,

In “send-to-local” case, both sender and receiver sockets are added to
the sockmap. Their protocol is modified from TCP to eBPF_TCP, so that
sendmsg will invoke “tcp_bpf_sendmsg” instead of “tcp_sendmsg”. In this
case, the whole process is building a skmsg and moving it to the
receiver socket’s queue immediately. In this process, there is no
sk_buff generated, and we cannot benefit from TCP stack optimizations.
As a result, small segments will not be merged by default, that's the
reason why I am implementing skmsg coalescing here.

>> This change introduces a kernel worker-based intermediate layer to provide
>> automatic message corking for TCP_BPF. While this adds a slight latency
>> overhead, it significantly improves overall throughput by reducing
>> unnecessary wake-ups and reducing the sock lock contention.
> 
> "Slight" for a +5% increase in latency is an understatement :-)
> 
> IDK about this being always on for every socket. For send-to-local
> [1], sk_msg redirs can be viewed as a form of IPC, where latency
> matters.
> 
> I do understand that you're trying to optimize for bulk-transfer
> workloads, but please consider also request-response workloads.
> 
> [1] https://github.com/jsitnicki/kubecon-2024-sockmap/blob/main/cheatsheet-sockmap-redirect.png
> 

Totally understand that request-response workloads are also very
important.

Here are my thoughts:

I had an idea before: when the user sets NO_DELAY, we could follow the
original code path. However, I'm concerned about a specific scenario: if
a user sends part of a message and then sets NO_DELAY to send another
message, it's possible that messages sent via kworker haven't yet
reached the ingress_msg (maybe due to delayed kworker scheduling), while
the messages sent with NO_DELAY have already arrived. This could disrupt
the order. Since there's no TCP packet formation or retransmission
mechanism in this process, once the order is disrupted, it stays that
way.

As a result, I propose,

- When the user sets NO_DELAY, introduce a wait (I haven't determined
the exact location yet; maybe in tcp_bpf_sendmsg) to ensure all messages
from kworker are sent before proceeding. Then follow the original path
for packet transmission.

- When the user switches back from NO_DELAY to DELAY, it's less of an 
issue. We can simply follow our current code path.

If 5% degradation is not a blocker for this patchset, and the solution
looks good to you, we will do it in the next patchset.

>> Reviewed-by: Amery Hung <amery.hung@...edance.com>
>> Co-developed-by: Cong Wang <cong.wang@...edance.com>
>> Signed-off-by: Cong Wang <cong.wang@...edance.com>
>> Signed-off-by: Zijian Zhang <zijianzhang@...edance.com>
>> ---