[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <dffb3057-40cf-463b-a114-9c9c3770f09c@bytedance.com>
Date: Wed, 12 Mar 2025 11:20:01 -0700
From: Zijian Zhang <zijianzhang@...edance.com>
To: John Fastabend <john.fastabend@...il.com>,
Cong Wang <xiyou.wangcong@...il.com>
Cc: netdev@...r.kernel.org, bpf@...r.kernel.org, jakub@...udflare.com,
zhoufeng.zf@...edance.com, Amery Hung <amery.hung@...edance.com>,
Cong Wang <cong.wang@...edance.com>
Subject: Re: [Patch bpf-next v2 4/4] tcp_bpf: improve ingress redirection
performance with message corking
On 3/11/25 1:54 PM, John Fastabend wrote:
> On 2025-03-06 14:02:05, Cong Wang wrote:
>> From: Zijian Zhang <zijianzhang@...edance.com>
[...]
>> +static int tcp_bpf_ingress_backlog(struct sock *sk, struct sock *sk_redir,
>> + struct sk_msg *msg, u32 apply_bytes)
>> +{
>> + bool ingress_msg_empty = false;
>> + bool apply = apply_bytes;
>> + struct sk_psock *psock;
>> + struct sk_msg *tmp;
>> + u32 tot_size = 0;
>> + int ret = 0;
>> + u8 nonagle;
>> +
>> + psock = sk_psock_get(sk_redir);
>> + if (unlikely(!psock))
>> + return -EPIPE;
>> +
>> + spin_lock(&psock->backlog_msg_lock);
>> + /* If possible, coalesce the curr sk_msg to the last sk_msg from the
>> + * psock->backlog_msg.
>> + */
>> + if (!list_empty(&psock->backlog_msg)) {
>> + struct sk_msg *last;
>> +
>> + last = list_last_entry(&psock->backlog_msg, struct sk_msg, list);
>> + if (last->sk == sk) {
>> + int i = tcp_bpf_coalesce_msg(last, msg, &apply_bytes,
>> + &tot_size);
>> +
>> + if (i == msg->sg.end || (apply && !apply_bytes))
>> + goto out_unlock;
>> + }
>> + }
>> +
>> + /* Otherwise, allocate a new sk_msg and transfer the data from the
>> + * passed in msg to it.
>> + */
>> + tmp = sk_msg_alloc(GFP_ATOMIC);
>> + if (!tmp) {
>> + ret = -ENOMEM;
>> + spin_unlock(&psock->backlog_msg_lock);
>> + goto error;
>> + }
>> +
>> + tmp->sk = sk;
>> + sock_hold(tmp->sk);
>> + tmp->sg.start = msg->sg.start;
>> + tcp_bpf_xfer_msg(tmp, msg, &apply_bytes, &tot_size);
>> +
>> + ingress_msg_empty = list_empty(&psock->ingress_msg);
>> + list_add_tail(&tmp->list, &psock->backlog_msg);
>> +
>> +out_unlock:
>> + spin_unlock(&psock->backlog_msg_lock);
>> + sk_wmem_queued_add(sk, tot_size);
>> +
>> + /* At this point, the data has been handled well. If one of the
>> + * following conditions is met, we can notify the peer socket in
>> + * the context of this system call immediately.
>> + * 1. If the write buffer has been used up;
>> + * 2. Or, the message size is larger than TCP_BPF_GSO_SIZE;
>> + * 3. Or, the ingress queue was empty;
>> + * 4. Or, the tcp socket is set to no_delay.
>> + * Otherwise, kick off the backlog work so that we can have some
>> + * time to wait for any incoming messages before sending a
>> + * notification to the peer socket.
>> + */
>
> I think this could also be used to get the bpf_msg_cork_bytes working
> directly in receive path. This also means we can avoid using
> strparser in the receive path. The strparser case has noticable
> overhead for us that is significant enough we don't use it.
> Not that we need to do it all in one patch set.
>
Sounds promising!
>> + nonagle = tcp_sk(sk)->nonagle;
>> + if (!sk_stream_memory_free(sk) ||
>> + tot_size >= TCP_BPF_GSO_SIZE || ingress_msg_empty ||
>> + (!(nonagle & TCP_NAGLE_CORK) && (nonagle & TCP_NAGLE_OFF))) {
>> + release_sock(sk);
>> + psock->backlog_work_delayed = false;
>> + sk_psock_backlog_msg(psock);
>> + lock_sock(sk);
>> + } else {
>> + sk_psock_run_backlog_work(psock, false);
>> + }
>> +
>> +error:
>> + sk_psock_put(sk_redir, psock);
>> + return ret;
>> +}
>> +
>> static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock,
>> struct sk_msg *msg, int *copied, int flags)
>> {
>> @@ -442,18 +619,24 @@ static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock,
>> cork = true;
>> psock->cork = NULL;
>> }
>> - release_sock(sk);
>>
>> - origsize = msg->sg.size;
>> - ret = tcp_bpf_sendmsg_redir(sk_redir, redir_ingress,
>> - msg, tosend, flags);
>
> The only sticky bit here that is blocking folding this entire tcp_bpf_sendmsg_redir
> logic out is tls user right?
>
Right, tls also uses tcp_bpf_sendmsg_redir.
>> - sent = origsize - msg->sg.size;
>> + if (redir_ingress) {
>> + ret = tcp_bpf_ingress_backlog(sk, sk_redir, msg, tosend);
>> + } else {
>> + release_sock(sk);
>> +
>> + origsize = msg->sg.size;
>> + ret = tcp_bpf_sendmsg_redir(sk_redir, redir_ingress,
>> + msg, tosend, flags);
>
> now sendmsg redir is really only for egress here so we can skip handling
> the ingress here. And the entire existing sk_psock_backlog work queue because
> its handled by tcp_bpf_ingress_backlog?
>
Agreed, tcp_bpf_sendmsg_redir here is only for egress.
From my understanding,
as for sk_psock_backlog, it handles the ingress skb in psock-
>ingress_skb.
[skb RX->Redirect->sk_msg(skb backed-up) RX]
On the other hand, tcp_bpf_ingress_backlog mainly focus on moving the
corked sk_msg from sender socket queue "backlog_msg" to receiver socket
psock->ingress_msg. These sk_msgs are redirected using __SK_REDIRECT
by tcp_bpf_sendmsg, in other words, these sk_msg->skb should be NULL.
[sk_msg TX->Redirect->sk_msg(skb is NULL) RX]
IMHO, they are mostly mutually independent.
>> + sent = origsize - msg->sg.size;
>> +
>> + lock_sock(sk);
>> + sk_mem_uncharge(sk, sent);
>> + }
>
> I like the direction but any blockers to just get this out of TLS as
> well? I'm happy to do it if needed I would prefer not to try and
> support both styles at the same time.
I haven't looked into TLS mainly because I'm not very familiar with it.
If you're interested, it would be great if you could take a look in the
future :)
Powered by blists - more mailing lists