[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250530200735.hhzeicomnb7mbwdl@gmail.com>
Date: Fri, 30 May 2025 13:07:35 -0700
From: John Fastabend <john.fastabend@...il.com>
To: Cong Wang <xiyou.wangcong@...il.com>
Cc: netdev@...r.kernel.org, bpf@...r.kernel.org, zhoufeng.zf@...edance.com,
jakub@...udflare.com, zijianzhang@...edance.com,
Amery Hung <amery.hung@...edance.com>,
Cong Wang <cong.wang@...edance.com>
Subject: Re: [Patch bpf-next v3 4/4] tcp_bpf: improve ingress redirection
performance with message corking
On 2025-05-19 13:36:28, Cong Wang wrote:
> From: Zijian Zhang <zijianzhang@...edance.com>
>
> The TCP_BPF ingress redirection path currently lacks the message corking
> mechanism found in standard TCP. This causes the sender to wake up the
> receiver for every message, even when messages are small, resulting in
> reduced throughput compared to regular TCP in certain scenarios.
>
> This change introduces a kernel worker-based intermediate layer to provide
> automatic message corking for TCP_BPF. While this adds a slight latency
> overhead, it significantly improves overall throughput by reducing
> unnecessary wake-ups and reducing the sock lock contention.
>
> Reviewed-by: Amery Hung <amery.hung@...edance.com>
> Co-developed-by: Cong Wang <cong.wang@...edance.com>
> Signed-off-by: Cong Wang <cong.wang@...edance.com>
> Signed-off-by: Zijian Zhang <zijianzhang@...edance.com>
> ---
> include/linux/skmsg.h | 19 ++++
> net/core/skmsg.c | 139 ++++++++++++++++++++++++++++-
> net/ipv4/tcp_bpf.c | 197 ++++++++++++++++++++++++++++++++++++++++--
> 3 files changed, 347 insertions(+), 8 deletions(-)
[...]
> + /* At this point, the data has been handled well. If one of the
> + * following conditions is met, we can notify the peer socket in
> + * the context of this system call immediately.
> + * 1. If the write buffer has been used up;
> + * 2. Or, the message size is larger than TCP_BPF_GSO_SIZE;
> + * 3. Or, the ingress queue was empty;
> + * 4. Or, the tcp socket is set to no_delay.
> + * Otherwise, kick off the backlog work so that we can have some
> + * time to wait for any incoming messages before sending a
> + * notification to the peer socket.
> + */
OK this series looks like it should work to me. See one small comment
below. Also from the perf numbers in the cover letter is the latency
difference reduced/removed if the socket is set to no_delay?
> + nonagle = tcp_sk(sk)->nonagle;
> + if (!sk_stream_memory_free(sk) ||
> + tot_size >= TCP_BPF_GSO_SIZE || ingress_msg_empty ||
> + (!(nonagle & TCP_NAGLE_CORK) && (nonagle & TCP_NAGLE_OFF))) {
> + release_sock(sk);
> + psock->backlog_work_delayed = false;
> + sk_psock_backlog_msg(psock);
> + lock_sock(sk);
> + } else {
> + sk_psock_run_backlog_work(psock, false);
> + }
> +
> +error:
> + sk_psock_put(sk_redir, psock);
> + return ret;
> +}
> +
> static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock,
> struct sk_msg *msg, int *copied, int flags)
> {
> @@ -442,18 +619,24 @@ static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock,
> cork = true;
> psock->cork = NULL;
> }
> - release_sock(sk);
>
> - origsize = msg->sg.size;
> - ret = tcp_bpf_sendmsg_redir(sk_redir, redir_ingress,
> - msg, tosend, flags);
> - sent = origsize - msg->sg.size;
> + if (redir_ingress) {
> + ret = tcp_bpf_ingress_backlog(sk, sk_redir, msg, tosend);
> + } else {
> + release_sock(sk);
> +
> + origsize = msg->sg.size;
> + ret = tcp_bpf_sendmsg_redir(sk_redir, redir_ingress,
> + msg, tosend, flags);
nit, we can drop redir ingress at this point from tcp_bpf_sendmsg_redir?
It no longer handles ingress? A follow up patch would probably be fine.
> + sent = origsize - msg->sg.size;
> +
> + lock_sock(sk);
> + sk_mem_uncharge(sk, sent);
> + }
>
> if (eval == __SK_REDIRECT)
> sock_put(sk_redir);
Thanks.
Powered by blists - more mailing lists