netdev - Re: Re: [PATCH bpf-next] bpf: tcp: Improve bpf write tcp opt performance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <06bb3780-7ba0-4d88-b212-5e5b7a1b92cb@bytedance.com>
Date: Wed, 5 Jun 2024 15:37:01 +0800
From: Feng Zhou <zhoufeng.zf@...edance.com>
To: Jakub Sitnicki <jakub@...udflare.com>
Cc: edumazet@...gle.com, ast@...nel.org, daniel@...earbox.net,
 andrii@...nel.org, martin.lau@...ux.dev, eddyz87@...il.com, song@...nel.org,
 yonghong.song@...ux.dev, john.fastabend@...il.com, kpsingh@...nel.org,
 sdf@...gle.com, haoluo@...gle.com, jolsa@...nel.org, davem@...emloft.net,
 dsahern@...nel.org, kuba@...nel.org, pabeni@...hat.com,
 laoar.shao@...il.com, netdev@...r.kernel.org, linux-kernel@...r.kernel.org,
 bpf@...r.kernel.org, yangzhenze@...edance.com, wangdongdong.6@...edance.com
Subject: Re: Re: [PATCH bpf-next] bpf: tcp: Improve bpf write tcp opt
 performance

在 2024/5/31 18:45, Jakub Sitnicki 写道:
> On Fri, May 17, 2024 at 03:27 PM +08, Feng Zhou wrote:
>> 在 2024/5/17 01:15, Jakub Sitnicki 写道:
>>> On Thu, May 16, 2024 at 11:15 AM +08, Feng Zhou wrote:
>>>> 在 2024/5/15 17:48, Jakub Sitnicki 写道:
> 
> [...]
> 
>>> If it's not the BPF prog, which you have ruled out, then where are we
>>> burining cycles? Maybe that is something that can be improved.
>>> Also, in terms on quantifying the improvement - it is 20% in terms of
>>> what? Throughput, pps, cycles? And was that a single data point? For
>>> multiple measurements there must be some variance (+/- X pp).
>>> Would be great to see some data to back it up.
>>> [...]
>>>
>>
>> Pressure measurement method:
>>
>> server: sockperf sr --tcp -i x.x.x.x -p 7654 --daemonize
>> client: taskset -c 8 sockperf tp --tcp -i x.x.x.x -p 7654 -m 1200 -t 30
>>
>> Default mode, no bpf prog:
>>
>> taskset -c 8 sockperf tp --tcp -i x.x.x.x -p 7654 -m 1200 -t 30
>> sockperf: == version #3.10-23.gited92afb185e6 ==
>> sockperf[CLIENT] send on:
>> [ 0] IP = x.x.x.x    PORT =  7654 # TCP
>> sockperf: Warmup stage (sending a few dummy messages)...
>> sockperf: Starting test...
>> sockperf: Test end (interrupted by timer)
>> sockperf: Test ended
>> sockperf: Total of 71520808 messages sent in 30.000 sec
>>
>> sockperf: NOTE: test was performed, using msg-size=1200. For getting maximum
>> throughput consider using --msg-size=1472
>> sockperf: Summary: Message Rate is 2384000 [msg/sec]
>> sockperf: Summary: BandWidth is 2728.271 MBps (21826.172 Mbps)
>>
>> perf record --call-graph fp -e cycles:k -C 8 -- sleep 10
>> perf report
>>
>> 80.88%--sock_sendmsg
>>   79.53%--tcp_sendmsg
>>    42.48%--tcp_sendmsg_locked
>>     16.23%--_copy_from_iter
>>     4.24%--tcp_send_mss
>>      3.25%--tcp_current_mss
>>
>>
>> perf top -C 8
>>
>> 19.13%  [kernel]            [k] _raw_spin_lock_bh
>> 11.75%  [kernel]            [k] copy_user_enhanced_fast_string
>>   9.86%  [kernel]            [k] tcp_sendmsg_locked
>>   4.44%  sockperf            [.]
>>   _Z14client_handlerI10IoRecvfrom9SwitchOff13PongModeNeverEviii
>>   4.16%  libpthread-2.28.so  [.] __libc_sendto
>>   3.85%  [kernel]            [k] syscall_return_via_sysret
>>   2.70%  [kernel]            [k] _copy_from_iter
>>   2.48%  [kernel]            [k] entry_SYSCALL_64
>>   2.33%  [kernel]            [k] native_queued_spin_lock_slowpath
>>   1.89%  [kernel]            [k] __virt_addr_valid
>>   1.77%  [kernel]            [k] __check_object_size
>>   1.75%  [kernel]            [k] __sys_sendto
>>   1.74%  [kernel]            [k] entry_SYSCALL_64_after_hwframe
>>   1.42%  [kernel]            [k] __fget_light
>>   1.28%  [kernel]            [k] tcp_push
>>   1.01%  [kernel]            [k] tcp_established_options
>>   0.97%  [kernel]            [k] tcp_send_mss
>>   0.94%  [kernel]            [k] syscall_exit_to_user_mode_prepare
>>   0.94%  [kernel]            [k] tcp_sendmsg
>>   0.86%  [kernel]            [k] tcp_current_mss
>>
>> Having bpf prog to write tcp opt in all pkts:
>>
>> taskset -c 8 sockperf tp --tcp -i x.x.x.x -p 7654 -m 1200 -t 30
>> sockperf: == version #3.10-23.gited92afb185e6 ==
>> sockperf[CLIENT] send on:
>> [ 0] IP = x.x.x.x    PORT =  7654 # TCP
>> sockperf: Warmup stage (sending a few dummy messages)...
>> sockperf: Starting test...
>> sockperf: Test end (interrupted by timer)
>> sockperf: Test ended
>> sockperf: Total of 60636218 messages sent in 30.000 sec
>>
>> sockperf: NOTE: test was performed, using msg-size=1200. For getting maximum
>> throughput consider using --msg-size=1472
>> sockperf: Summary: Message Rate is 2021185 [msg/sec]
>> sockperf: Summary: BandWidth is 2313.063 MBps (18504.501 Mbps)
>>
>> perf record --call-graph fp -e cycles:k -C 8 -- sleep 10
>> perf report
>>
>> 80.30%--sock_sendmsg
>>   79.02%--tcp_sendmsg
>>    54.14%--tcp_sendmsg_locked
>>     12.82%--_copy_from_iter
>>     12.51%--tcp_send_mss
>>      11.77%--tcp_current_mss
>>       10.10%--tcp_established_options
>>        8.75%--bpf_skops_hdr_opt_len.isra.54
>>         5.71%--__cgroup_bpf_run_filter_sock_ops
>>          3.32%--bpf_prog_e7ccbf819f5be0d0_tcpopt
>>    6.61%--__tcp_push_pending_frames
>>     6.60%--tcp_write_xmit
>>      5.89%--__tcp_transmit_skb
>>
>> perf top -C 8
>>
>> 10.98%  [kernel]                           [k] _raw_spin_lock_bh
>>   9.04%  [kernel]                           [k] copy_user_enhanced_fast_string
>>   7.78%  [kernel]                           [k] tcp_sendmsg_locked
>>   3.91%  sockperf                           [.]
>>   _Z14client_handlerI10IoRecvfrom9SwitchOff13PongModeNeverEviii
>>   3.46%  libpthread-2.28.so                 [.] __libc_sendto
>>   3.35%  [kernel]                           [k] syscall_return_via_sysret
>>   2.86%  [kernel]                           [k] bpf_skops_hdr_opt_len.isra.54
>>   2.16%  [kernel]                           [k] __htab_map_lookup_elem
>>   2.11%  [kernel]                           [k] _copy_from_iter
>>   2.09%  [kernel]                           [k] entry_SYSCALL_64
>>   1.97%  [kernel]                           [k] __virt_addr_valid
>>   1.95%  [kernel]                           [k] __cgroup_bpf_run_filter_sock_ops
>>   1.95%  [kernel]                           [k] lookup_nulls_elem_raw
>>   1.89%  [kernel]                           [k] __fget_light
>>   1.42%  [kernel]                           [k] __sys_sendto
>>   1.41%  [kernel]                           [k] entry_SYSCALL_64_after_hwframe
>>   1.31%  [kernel]                           [k] native_queued_spin_lock_slowpath
>>   1.22%  [kernel]                           [k] __check_object_size
>>   1.18%  [kernel]                           [k] tcp_established_options
>>   1.04%  bpf_prog_e7ccbf819f5be0d0_tcpopt   [k] bpf_prog_e7ccbf819f5be0d0_tcpopt
>>
>> Compare the above test results, fill up a CPU, you can find that
>> the upper limit of qps or BandWidth has a loss of nearly 18-20%.
>> Then CPU occupancy, you can find that "tcp_send_mss" has increased
>> significantly.
> 
> This helps prove the point, but what I actually had in mind is to check
> "perf annotate bpf_skops_hdr_opt_len" and see if there any low hanging
> fruit there which we can optimize.
> 
> For instance, when I benchmark it in a VM, I see we're spending cycles
> mostly memset()/rep stos. I have no idea where the cycles are spent in
> your case.
> 
>>

How do you do your pressure test? Can you send it to me for a try? Or 
you can try my pressure test method. Have you checked the calling 
frequency of bpf_skops_hdr_opt_len and bpf_skops_write_hdr_opt?

>>>>>> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
>>>>>> index 90706a47f6ff..f2092de1f432 100644
>>>>>> --- a/tools/include/uapi/linux/bpf.h
>>>>>> +++ b/tools/include/uapi/linux/bpf.h
>>>>>> @@ -6892,8 +6892,14 @@ enum {
>>>>>>     	 * options first before the BPF program does.
>>>>>>     	 */
>>>>>>     	BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6),
>>>>>> +	/* Fast path to reserve space in a skb under
>>>>>> +	 * sock_ops->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB.
>>>>>> +	 * opt length doesn't change often, so it can save in the tcp_sock. And
>>>>>> +	 * set BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG to no bpf call.
>>>>>> +	 */
>>>>>> +	BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG = (1<<7),
>>>>> Have you considered a bpf_reserve_hdr_opt() flag instead?
>>>>> An example or test coverage would to show this API extension in action
>>>>> would help.
>>>>>
>>>>
>>>> bpf_reserve_hdr_opt () flag can't finish this. I want to optimize
>>>> that bpf prog will not be triggered frequently before TSO. Provide
>>>> a way for users to not trigger bpf prog when opt len is unchanged.
>>>> Then when writing opt, if len changes, clear the flag, and then
>>>> change opt len in the next package.
>>> I haven't seen a sample using the API extenstion that you're proposing,
>>> so I can only guess. But you probably have something like:
>>> SEC("sockops")
>>> int sockops_prog(struct bpf_sock_ops *ctx)
>>> {
>>> 	if (ctx->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB &&
>>> 	    ctx->args[0] == BPF_WRITE_HDR_TCP_CURRENT_MSS) {
>>> 		bpf_reserve_hdr_opt(ctx, N, 0);
>>> 		bpf_sock_ops_cb_flags_set(ctx,
>>> 					  ctx->bpf_sock_ops_cb_flags |
>>> 					  MY_NEW_FLAG);
>>> 		return 1;
>>> 	}
>>> }
>>
>> Yes, that's what I expected.
>>
>>> I don't understand why you're saying it can't be transformed into:
>>> int sockops_prog(struct bpf_sock_ops *ctx)
>>> {
>>> 	if (ctx->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB &&
>>> 	    ctx->args[0] == BPF_WRITE_HDR_TCP_CURRENT_MSS) {
>>> 		bpf_reserve_hdr_opt(ctx, N, MY_NEW_FLAG);
>>> 		return 1;
>>> 	}
>>> }
>>
>> "bpf_reserve_hdr_opt (ctx, N, MY_NEW_FLAG);"
>>
>> I don't know what I can do to pass the flag parameter, let
>> "bpf_reserve_hdr_opt" return quickly? But this is not useful,
>> because the loss caused by the triggering of bpf prog is very
>> expensive, and it is still on the hotspot function of sending
>> packets, and the TSO has not been completed yet.
>>
>>> [...]
> 
> This is not what I'm suggesting.
> 
> bpf_reserve_hdr_opt() has access to bpf_sock_ops_kern and even the
> sock. You could either signal through bpf_sock_ops_kern to
> bpf_skops_hdr_opt_len() that it should not be called again
> 
> Or even configure the tcp_sock directly from bpf_reserve_hdr_opt()
> because it has access to it via bpf_sock_ops_kern{}.sk.

Oh, I see what you mean, this will achieve the goal.