[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <9e167af1-1265-4427-806e-67eac349cbf3@bytedance.com>
Date: Fri, 30 May 2025 13:37:37 -0700
From: Zijian Zhang <zijianzhang@...edance.com>
To: John Fastabend <john.fastabend@...il.com>,
Cong Wang <xiyou.wangcong@...il.com>
Cc: netdev@...r.kernel.org, bpf@...r.kernel.org, zhoufeng.zf@...edance.com,
jakub@...udflare.com, Amery Hung <amery.hung@...edance.com>,
Cong Wang <cong.wang@...edance.com>
Subject: Re: [Patch bpf-next v3 4/4] tcp_bpf: improve ingress redirection
performance with message corking
On 5/30/25 1:07 PM, John Fastabend wrote:
> On 2025-05-19 13:36:28, Cong Wang wrote:
>> From: Zijian Zhang <zijianzhang@...edance.com>
>>
>> The TCP_BPF ingress redirection path currently lacks the message corking
>> mechanism found in standard TCP. This causes the sender to wake up the
>> receiver for every message, even when messages are small, resulting in
>> reduced throughput compared to regular TCP in certain scenarios.
>>
>> This change introduces a kernel worker-based intermediate layer to provide
>> automatic message corking for TCP_BPF. While this adds a slight latency
>> overhead, it significantly improves overall throughput by reducing
>> unnecessary wake-ups and reducing the sock lock contention.
>>
>> Reviewed-by: Amery Hung <amery.hung@...edance.com>
>> Co-developed-by: Cong Wang <cong.wang@...edance.com>
>> Signed-off-by: Cong Wang <cong.wang@...edance.com>
>> Signed-off-by: Zijian Zhang <zijianzhang@...edance.com>
>> ---
>> include/linux/skmsg.h | 19 ++++
>> net/core/skmsg.c | 139 ++++++++++++++++++++++++++++-
>> net/ipv4/tcp_bpf.c | 197 ++++++++++++++++++++++++++++++++++++++++--
>> 3 files changed, 347 insertions(+), 8 deletions(-)
>
> [...]
>
>> + /* At this point, the data has been handled well. If one of the
>> + * following conditions is met, we can notify the peer socket in
>> + * the context of this system call immediately.
>> + * 1. If the write buffer has been used up;
>> + * 2. Or, the message size is larger than TCP_BPF_GSO_SIZE;
>> + * 3. Or, the ingress queue was empty;
>> + * 4. Or, the tcp socket is set to no_delay.
>> + * Otherwise, kick off the backlog work so that we can have some
>> + * time to wait for any incoming messages before sending a
>> + * notification to the peer socket.
>> + */
>
>
> OK this series looks like it should work to me. See one small comment
> below. Also from the perf numbers in the cover letter is the latency
> difference reduced/removed if the socket is set to no_delay?
>
Even if the socket is set to no_delay, we still have minor latency diff.
The main reason is that we now have dynamic allocation for skmsg and
kworker in the middle, the path is more complex now.
>> + nonagle = tcp_sk(sk)->nonagle;
>> + if (!sk_stream_memory_free(sk) ||
>> + tot_size >= TCP_BPF_GSO_SIZE || ingress_msg_empty ||
>> + (!(nonagle & TCP_NAGLE_CORK) && (nonagle & TCP_NAGLE_OFF))) {
>> + release_sock(sk);
>> + psock->backlog_work_delayed = false;
>> + sk_psock_backlog_msg(psock);
>> + lock_sock(sk);
>> + } else {
>> + sk_psock_run_backlog_work(psock, false);
>> + }
>> +
>> +error:
>> + sk_psock_put(sk_redir, psock);
>> + return ret;
>> +}
>> +
>> static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock,
>> struct sk_msg *msg, int *copied, int flags)
>> {
>> @@ -442,18 +619,24 @@ static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock,
>> cork = true;
>> psock->cork = NULL;
>> }
>> - release_sock(sk);
>>
>> - origsize = msg->sg.size;
>> - ret = tcp_bpf_sendmsg_redir(sk_redir, redir_ingress,
>> - msg, tosend, flags);
>> - sent = origsize - msg->sg.size;
>> + if (redir_ingress) {
>> + ret = tcp_bpf_ingress_backlog(sk, sk_redir, msg, tosend);
>> + } else {
>> + release_sock(sk);
>> +
>> + origsize = msg->sg.size;
>> + ret = tcp_bpf_sendmsg_redir(sk_redir, redir_ingress,
>> + msg, tosend, flags);
>
> nit, we can drop redir ingress at this point from tcp_bpf_sendmsg_redir?
> It no longer handles ingress? A follow up patch would probably be fine.
>
Indeed, we will do this in a follow up patch.
>> + sent = origsize - msg->sg.size;
>> +
>> + lock_sock(sk);
>> + sk_mem_uncharge(sk, sent);
>> + }
>>
>> if (eval == __SK_REDIRECT)
>> sock_put(sk_redir);
>
> Thanks.
Thanks for the review!
Powered by blists - more mailing lists