linux-kernel - Re: Re: [PATCH bpf-next] bpf: tcp: Improve bpf write tcp opt performance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1803b7c0-bc56-46d6-835f-f3802b8b7e00@bytedance.com>
Date: Thu, 16 May 2024 11:15:43 +0800
From: Feng Zhou <zhoufeng.zf@...edance.com>
To: Jakub Sitnicki <jakub@...udflare.com>
Cc: edumazet@...gle.com, ast@...nel.org, daniel@...earbox.net,
 andrii@...nel.org, martin.lau@...ux.dev, eddyz87@...il.com, song@...nel.org,
 yonghong.song@...ux.dev, john.fastabend@...il.com, kpsingh@...nel.org,
 sdf@...gle.com, haoluo@...gle.com, jolsa@...nel.org, davem@...emloft.net,
 dsahern@...nel.org, kuba@...nel.org, pabeni@...hat.com,
 laoar.shao@...il.com, netdev@...r.kernel.org, linux-kernel@...r.kernel.org,
 bpf@...r.kernel.org, yangzhenze@...edance.com, wangdongdong.6@...edance.com
Subject: Re: Re: [PATCH bpf-next] bpf: tcp: Improve bpf write tcp opt
 performance

在 2024/5/15 17:48, Jakub Sitnicki 写道:
> On Wed, May 15, 2024 at 04:19 PM +08, Feng zhou wrote:
>> From: Feng Zhou <zhoufeng.zf@...edance.com>
>>
>> Set the full package write tcp option, the test found that the loss
>> will be 20%. If a package wants to write tcp option, it will trigger
>> bpf prog three times, and call "tcp_send_mss" calculate mss_cache,
>> call "tcp_established_options" to reserve tcp opt len, call
>> "bpf_skops_write_hdr_opt" to write tcp opt, but "tcp_send_mss" before
>> TSO. Through bpftrace tracking, it was found that during the pressure
>> test, "tcp_send_mss" call frequency was 90w/s. Considering that opt
>> len does not change often, consider caching opt len for optimization.
> 
> You could also make your BPF sock_ops program cache the value and return
> the cached value when called for BPF_SOCK_OPS_HDR_OPT_LEN_CB.
> 
> If that is in your opinion prohibitevely expensive then it would be good
> to see a sample program and CPU cycle measurements (bpftool prog profile).
> 

I'm not referring to the overhead introduced by the time-consuming
operation of bpf prog. I have tested that bpf prog does nothing and
returns directly, and the loss is still 20%. During the pressure test
process, "tcp_send_mss" and "__tcp_transmit_skb" the call frequency per
second

@[
     bpf_skops_hdr_opt_len.isra.46+1
     tcp_established_options+730
     tcp_current_mss+81
     tcp_send_mss+23
     tcp_sendmsg_locked+285
     tcp_sendmsg+58
     sock_sendmsg+48
     sock_write_iter+151
     new_sync_write+296
     vfs_write+165
     ksys_write+89
     do_syscall_64+89
     entry_SYSCALL_64_after_hwframe+68
]: 3671671

@[
     bpf_skops_write_hdr_opt.isra.47+1
     __tcp_transmit_skb+761
     tcp_write_xmit+822
     __tcp_push_pending_frames+52
     tcp_close+813
     inet_release+60
     __sock_release+55
     sock_close+17
     __fput+179
     task_work_run+112
     exit_to_usermode_loop+245
     do_syscall_64+456
     entry_SYSCALL_64_after_hwframe+68
]: 36125

"tcp_send_mss" before TSO, without packet aggregation, and
"__tcp_transmit_skb" after TSO, the gap between the two is
100 times.

>>
>> Signed-off-by: Feng Zhou <zhoufeng.zf@...edance.com>
>> ---
>>   include/linux/tcp.h            |  3 +++
>>   include/uapi/linux/bpf.h       |  8 +++++++-
>>   net/ipv4/tcp_output.c          | 12 +++++++++++-
>>   tools/include/uapi/linux/bpf.h |  8 +++++++-
>>   4 files changed, 28 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
>> index 6a5e08b937b3..74437fcf94a2 100644
>> --- a/include/linux/tcp.h
>> +++ b/include/linux/tcp.h
>> @@ -455,6 +455,9 @@ struct tcp_sock {
>>   					  * to recur itself by calling
>>   					  * bpf_setsockopt(TCP_CONGESTION, "itself").
>>   					  */
>> +	u8	bpf_opt_len;		/* save tcp opt len implementation
>> +					 * BPF_SOCK_OPS_HDR_OPT_LEN_CB fast path
>> +					 */
>>   #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) (TP->bpf_sock_ops_cb_flags & ARG)
>>   #else
>>   #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) 0
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 90706a47f6ff..f2092de1f432 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -6892,8 +6892,14 @@ enum {
>>   	 * options first before the BPF program does.
>>   	 */
>>   	BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6),
>> +	/* Fast path to reserve space in a skb under
>> +	 * sock_ops->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB.
>> +	 * opt length doesn't change often, so it can save in the tcp_sock. And
>> +	 * set BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG to no bpf call.
>> +	 */
>> +	BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG = (1<<7),
>>   /* Mask of all currently supported cb flags */
>> -	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0x7F,
>> +	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0xFF,
>>   };
>>   
>>   /* List of known BPF sock_ops operators.
>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>> index ea7ad7d99245..0e7480a58012 100644
>> --- a/net/ipv4/tcp_output.c
>> +++ b/net/ipv4/tcp_output.c
>> @@ -488,12 +488,21 @@ static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb,
>>   {
>>   	struct bpf_sock_ops_kern sock_ops;
>>   	int err;
>> +	struct tcp_sock *th = (struct tcp_sock *)sk;
>>   
>> -	if (likely(!BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk),
>> +	if (likely(!BPF_SOCK_OPS_TEST_FLAG(th,
>>   					   BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG)) ||
>>   	    !*remaining)
>>   		return;
>>   
>> +	if (likely(BPF_SOCK_OPS_TEST_FLAG(th,
>> +					  BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG)) &&
>> +	    th->bpf_opt_len) {
>> +		*remaining -= th->bpf_opt_len;
> 
> What if *remaining value shrinks from one call to the next?
> 
> BPF sock_ops program can't react to change. Feels like there should be a
> safety check to prevent an underflow.
> 

Thanks for the reminder, I'll add a judgment.

>> +		opts->bpf_opt_len = th->bpf_opt_len;
>> +		return;
>> +	}
>> +
>>   	/* *remaining has already been aligned to 4 bytes, so *remaining >= 4 */
>>   
>>   	/* init sock_ops */
>> @@ -538,6 +547,7 @@ static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb,
>>   	opts->bpf_opt_len = *remaining - sock_ops.remaining_opt_len;
>>   	/* round up to 4 bytes */
>>   	opts->bpf_opt_len = (opts->bpf_opt_len + 3) & ~3;
>> +	th->bpf_opt_len = opts->bpf_opt_len;
>>   
>>   	*remaining -= opts->bpf_opt_len;
>>   }
>> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
>> index 90706a47f6ff..f2092de1f432 100644
>> --- a/tools/include/uapi/linux/bpf.h
>> +++ b/tools/include/uapi/linux/bpf.h
>> @@ -6892,8 +6892,14 @@ enum {
>>   	 * options first before the BPF program does.
>>   	 */
>>   	BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6),
>> +	/* Fast path to reserve space in a skb under
>> +	 * sock_ops->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB.
>> +	 * opt length doesn't change often, so it can save in the tcp_sock. And
>> +	 * set BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG to no bpf call.
>> +	 */
>> +	BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG = (1<<7),
> 
> Have you considered a bpf_reserve_hdr_opt() flag instead?
> 
> An example or test coverage would to show this API extension in action
> would help.
> 

bpf_reserve_hdr_opt () flag can't finish this. I want to optimize
that bpf prog will not be triggered frequently before TSO. Provide
a way for users to not trigger bpf prog when opt len is unchanged.
Then when writing opt, if len changes, clear the flag, and then
change opt len in the next package.

In the next version, I will add test cases.

>>   /* Mask of all currently supported cb flags */
>> -	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0x7F,
>> +	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0xFF,
>>   };
>>   
>>   /* List of known BPF sock_ops operators.