netdev - Re: [Patch bpf-next v4 4/4] tcp_bpf: improve ingress redirection performance with message corking

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <87ecuyn5x2.fsf@cloudflare.com>
Date: Wed, 02 Jul 2025 14:17:13 +0200
From: Jakub Sitnicki <jakub@...udflare.com>
To: Cong Wang <xiyou.wangcong@...il.com>, zijianzhang@...edance.com
Cc: netdev@...r.kernel.org,  bpf@...r.kernel.org,  john.fastabend@...il.com,
  zhoufeng.zf@...edance.com,  Amery Hung <amery.hung@...edance.com>,  Cong
 Wang <cong.wang@...edance.com>
Subject: Re: [Patch bpf-next v4 4/4] tcp_bpf: improve ingress redirection
 performance with message corking

On Mon, Jun 30, 2025 at 06:12 PM -07, Cong Wang wrote:
> From: Zijian Zhang <zijianzhang@...edance.com>
>
> The TCP_BPF ingress redirection path currently lacks the message corking
> mechanism found in standard TCP. This causes the sender to wake up the
> receiver for every message, even when messages are small, resulting in
> reduced throughput compared to regular TCP in certain scenarios.

I'm curious what scenarios are you referring to? Is it send-to-local or
ingress-to-local? [1]

If the sender is emitting small messages, that's probably intended -
that is they likely want to get the message across as soon as possible,
because They must have disabled the Nagle algo (set TCP_NODELAY) to do
that.

Otherwise, you get small segment merging on the sender side by default.
And if MTU is a limiting factor, you should also be getting batching
from GRO.

What I'm getting at is that I don't quite follow why you don't see
sufficient batching before the sockmap redirect today?

> This change introduces a kernel worker-based intermediate layer to provide
> automatic message corking for TCP_BPF. While this adds a slight latency
> overhead, it significantly improves overall throughput by reducing
> unnecessary wake-ups and reducing the sock lock contention.

"Slight" for a +5% increase in latency is an understatement :-)

IDK about this being always on for every socket. For send-to-local
[1], sk_msg redirs can be viewed as a form of IPC, where latency
matters.

I do understand that you're trying to optimize for bulk-transfer
workloads, but please consider also request-response workloads.

[1] https://github.com/jsitnicki/kubecon-2024-sockmap/blob/main/cheatsheet-sockmap-redirect.png

> Reviewed-by: Amery Hung <amery.hung@...edance.com>
> Co-developed-by: Cong Wang <cong.wang@...edance.com>
> Signed-off-by: Cong Wang <cong.wang@...edance.com>
> Signed-off-by: Zijian Zhang <zijianzhang@...edance.com>
> ---