netdev - Re: [Patch bpf-next v2 4/4] tcp_bpf: improve ingress redirection performance with message corking

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <dffb3057-40cf-463b-a114-9c9c3770f09c@bytedance.com>
Date: Wed, 12 Mar 2025 11:20:01 -0700
From: Zijian Zhang <zijianzhang@...edance.com>
To: John Fastabend <john.fastabend@...il.com>,
 Cong Wang <xiyou.wangcong@...il.com>
Cc: netdev@...r.kernel.org, bpf@...r.kernel.org, jakub@...udflare.com,
 zhoufeng.zf@...edance.com, Amery Hung <amery.hung@...edance.com>,
 Cong Wang <cong.wang@...edance.com>
Subject: Re: [Patch bpf-next v2 4/4] tcp_bpf: improve ingress redirection
 performance with message corking

On 3/11/25 1:54 PM, John Fastabend wrote:
> On 2025-03-06 14:02:05, Cong Wang wrote:
>> From: Zijian Zhang <zijianzhang@...edance.com>
[...]
>> +static int tcp_bpf_ingress_backlog(struct sock *sk, struct sock *sk_redir,
>> +				   struct sk_msg *msg, u32 apply_bytes)
>> +{
>> +	bool ingress_msg_empty = false;
>> +	bool apply = apply_bytes;
>> +	struct sk_psock *psock;
>> +	struct sk_msg *tmp;
>> +	u32 tot_size = 0;
>> +	int ret = 0;
>> +	u8 nonagle;
>> +
>> +	psock = sk_psock_get(sk_redir);
>> +	if (unlikely(!psock))
>> +		return -EPIPE;
>> +
>> +	spin_lock(&psock->backlog_msg_lock);
>> +	/* If possible, coalesce the curr sk_msg to the last sk_msg from the
>> +	 * psock->backlog_msg.
>> +	 */
>> +	if (!list_empty(&psock->backlog_msg)) {
>> +		struct sk_msg *last;
>> +
>> +		last = list_last_entry(&psock->backlog_msg, struct sk_msg, list);
>> +		if (last->sk == sk) {
>> +			int i = tcp_bpf_coalesce_msg(last, msg, &apply_bytes,
>> +						     &tot_size);
>> +
>> +			if (i == msg->sg.end || (apply && !apply_bytes))
>> +				goto out_unlock;
>> +		}
>> +	}
>> +
>> +	/* Otherwise, allocate a new sk_msg and transfer the data from the
>> +	 * passed in msg to it.
>> +	 */
>> +	tmp = sk_msg_alloc(GFP_ATOMIC);
>> +	if (!tmp) {
>> +		ret = -ENOMEM;
>> +		spin_unlock(&psock->backlog_msg_lock);
>> +		goto error;
>> +	}
>> +
>> +	tmp->sk = sk;
>> +	sock_hold(tmp->sk);
>> +	tmp->sg.start = msg->sg.start;
>> +	tcp_bpf_xfer_msg(tmp, msg, &apply_bytes, &tot_size);
>> +
>> +	ingress_msg_empty = list_empty(&psock->ingress_msg);
>> +	list_add_tail(&tmp->list, &psock->backlog_msg);
>> +
>> +out_unlock:
>> +	spin_unlock(&psock->backlog_msg_lock);
>> +	sk_wmem_queued_add(sk, tot_size);
>> +
>> +	/* At this point, the data has been handled well. If one of the
>> +	 * following conditions is met, we can notify the peer socket in
>> +	 * the context of this system call immediately.
>> +	 * 1. If the write buffer has been used up;
>> +	 * 2. Or, the message size is larger than TCP_BPF_GSO_SIZE;
>> +	 * 3. Or, the ingress queue was empty;
>> +	 * 4. Or, the tcp socket is set to no_delay.
>> +	 * Otherwise, kick off the backlog work so that we can have some
>> +	 * time to wait for any incoming messages before sending a
>> +	 * notification to the peer socket.
>> +	 */
> 
> I think this could also be used to get the bpf_msg_cork_bytes working
> directly in receive path. This also means we can avoid using
> strparser in the receive path. The strparser case has noticable
> overhead for us that is significant enough we don't use it.
> Not that we need to do it all in one patch set.
> 

Sounds promising!

>> +	nonagle = tcp_sk(sk)->nonagle;
>> +	if (!sk_stream_memory_free(sk) ||
>> +	    tot_size >= TCP_BPF_GSO_SIZE || ingress_msg_empty ||
>> +	    (!(nonagle & TCP_NAGLE_CORK) && (nonagle & TCP_NAGLE_OFF))) {
>> +		release_sock(sk);
>> +		psock->backlog_work_delayed = false;
>> +		sk_psock_backlog_msg(psock);
>> +		lock_sock(sk);
>> +	} else {
>> +		sk_psock_run_backlog_work(psock, false);
>> +	}
>> +
>> +error:
>> +	sk_psock_put(sk_redir, psock);
>> +	return ret;
>> +}
>> +
>>   static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock,
>>   				struct sk_msg *msg, int *copied, int flags)
>>   {
>> @@ -442,18 +619,24 @@ static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock,
>>   			cork = true;
>>   			psock->cork = NULL;
>>   		}
>> -		release_sock(sk);
>>   
>> -		origsize = msg->sg.size;
>> -		ret = tcp_bpf_sendmsg_redir(sk_redir, redir_ingress,
>> -					    msg, tosend, flags);
> 
> The only sticky bit here that is blocking folding this entire tcp_bpf_sendmsg_redir
> logic out is tls user right?
> 

Right, tls also uses tcp_bpf_sendmsg_redir.

>> -		sent = origsize - msg->sg.size;
>> +		if (redir_ingress) {
>> +			ret = tcp_bpf_ingress_backlog(sk, sk_redir, msg, tosend);
>> +		} else {
>> +			release_sock(sk);
>> +
>> +			origsize = msg->sg.size;
>> +			ret = tcp_bpf_sendmsg_redir(sk_redir, redir_ingress,
>> +						    msg, tosend, flags);
> 
> now sendmsg redir is really only for egress here so we can skip handling
> the ingress here. And the entire existing sk_psock_backlog work queue because
> its handled by tcp_bpf_ingress_backlog?
> 

Agreed, tcp_bpf_sendmsg_redir here is only for egress.

 From my understanding,
as for sk_psock_backlog, it handles the ingress skb in psock-
 >ingress_skb.
[skb RX->Redirect->sk_msg(skb backed-up) RX]

On the other hand, tcp_bpf_ingress_backlog mainly focus on moving the
corked sk_msg from sender socket queue "backlog_msg" to receiver socket
psock->ingress_msg. These sk_msgs are redirected using __SK_REDIRECT
by tcp_bpf_sendmsg, in other words, these sk_msg->skb should be NULL.
[sk_msg TX->Redirect->sk_msg(skb is NULL) RX]

IMHO, they are mostly mutually independent.

>> +			sent = origsize - msg->sg.size;
>> +
>> +			lock_sock(sk);
>> +			sk_mem_uncharge(sk, sent);
>> +		}
> 
> I like the direction but any blockers to just get this out of TLS as
> well? I'm happy to do it if needed I would prefer not to try and
> support both styles at the same time.

I haven't looked into TLS mainly because I'm not very familiar with it. 
If you're interested, it would be great if you could take a look in the
future :)