netdev - Re: [Patch bpf-next v3 4/4] tcp_bpf: improve ingress redirection performance with message corking

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <9e167af1-1265-4427-806e-67eac349cbf3@bytedance.com>
Date: Fri, 30 May 2025 13:37:37 -0700
From: Zijian Zhang <zijianzhang@...edance.com>
To: John Fastabend <john.fastabend@...il.com>,
 Cong Wang <xiyou.wangcong@...il.com>
Cc: netdev@...r.kernel.org, bpf@...r.kernel.org, zhoufeng.zf@...edance.com,
 jakub@...udflare.com, Amery Hung <amery.hung@...edance.com>,
 Cong Wang <cong.wang@...edance.com>
Subject: Re: [Patch bpf-next v3 4/4] tcp_bpf: improve ingress redirection
 performance with message corking

On 5/30/25 1:07 PM, John Fastabend wrote:
> On 2025-05-19 13:36:28, Cong Wang wrote:
>> From: Zijian Zhang <zijianzhang@...edance.com>
>>
>> The TCP_BPF ingress redirection path currently lacks the message corking
>> mechanism found in standard TCP. This causes the sender to wake up the
>> receiver for every message, even when messages are small, resulting in
>> reduced throughput compared to regular TCP in certain scenarios.
>>
>> This change introduces a kernel worker-based intermediate layer to provide
>> automatic message corking for TCP_BPF. While this adds a slight latency
>> overhead, it significantly improves overall throughput by reducing
>> unnecessary wake-ups and reducing the sock lock contention.
>>
>> Reviewed-by: Amery Hung <amery.hung@...edance.com>
>> Co-developed-by: Cong Wang <cong.wang@...edance.com>
>> Signed-off-by: Cong Wang <cong.wang@...edance.com>
>> Signed-off-by: Zijian Zhang <zijianzhang@...edance.com>
>> ---
>>   include/linux/skmsg.h |  19 ++++
>>   net/core/skmsg.c      | 139 ++++++++++++++++++++++++++++-
>>   net/ipv4/tcp_bpf.c    | 197 ++++++++++++++++++++++++++++++++++++++++--
>>   3 files changed, 347 insertions(+), 8 deletions(-)
> 
> [...]
> 
>> +	/* At this point, the data has been handled well. If one of the
>> +	 * following conditions is met, we can notify the peer socket in
>> +	 * the context of this system call immediately.
>> +	 * 1. If the write buffer has been used up;
>> +	 * 2. Or, the message size is larger than TCP_BPF_GSO_SIZE;
>> +	 * 3. Or, the ingress queue was empty;
>> +	 * 4. Or, the tcp socket is set to no_delay.
>> +	 * Otherwise, kick off the backlog work so that we can have some
>> +	 * time to wait for any incoming messages before sending a
>> +	 * notification to the peer socket.
>> +	 */
> 
> 
> OK this series looks like it should work to me. See one small comment
> below. Also from the perf numbers in the cover letter is the latency
> difference reduced/removed if the socket is set to no_delay?
> 

Even if the socket is set to no_delay, we still have minor latency diff.
The main reason is that we now have dynamic allocation for skmsg and
kworker in the middle, the path is more complex now.

>> +	nonagle = tcp_sk(sk)->nonagle;
>> +	if (!sk_stream_memory_free(sk) ||
>> +	    tot_size >= TCP_BPF_GSO_SIZE || ingress_msg_empty ||
>> +	    (!(nonagle & TCP_NAGLE_CORK) && (nonagle & TCP_NAGLE_OFF))) {
>> +		release_sock(sk);
>> +		psock->backlog_work_delayed = false;
>> +		sk_psock_backlog_msg(psock);
>> +		lock_sock(sk);
>> +	} else {
>> +		sk_psock_run_backlog_work(psock, false);
>> +	}
>> +
>> +error:
>> +	sk_psock_put(sk_redir, psock);
>> +	return ret;
>> +}
>> +
>>   static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock,
>>   				struct sk_msg *msg, int *copied, int flags)
>>   {
>> @@ -442,18 +619,24 @@ static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock,
>>   			cork = true;
>>   			psock->cork = NULL;
>>   		}
>> -		release_sock(sk);
>>   
>> -		origsize = msg->sg.size;
>> -		ret = tcp_bpf_sendmsg_redir(sk_redir, redir_ingress,
>> -					    msg, tosend, flags);
>> -		sent = origsize - msg->sg.size;
>> +		if (redir_ingress) {
>> +			ret = tcp_bpf_ingress_backlog(sk, sk_redir, msg, tosend);
>> +		} else {
>> +			release_sock(sk);
>> +
>> +			origsize = msg->sg.size;
>> +			ret = tcp_bpf_sendmsg_redir(sk_redir, redir_ingress,
>> +						    msg, tosend, flags);
> 
> nit, we can drop redir ingress at this point from tcp_bpf_sendmsg_redir?
> It no longer handles ingress? A follow up patch would probably be fine.
> 

Indeed, we will do this in a follow up patch.

>> +			sent = origsize - msg->sg.size;
>> +
>> +			lock_sock(sk);
>> +			sk_mem_uncharge(sk, sent);
>> +		}
>>   
>>   		if (eval == __SK_REDIRECT)
>>   			sock_put(sk_redir);
> 
> Thanks.

Thanks for the review!