[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <875xuuxntc.fsf@cloudflare.com>
Date: Fri, 31 May 2024 12:45:03 +0200
From: Jakub Sitnicki <jakub@...udflare.com>
To: Feng Zhou <zhoufeng.zf@...edance.com>
Cc: edumazet@...gle.com, ast@...nel.org, daniel@...earbox.net,
andrii@...nel.org, martin.lau@...ux.dev, eddyz87@...il.com,
song@...nel.org, yonghong.song@...ux.dev, john.fastabend@...il.com,
kpsingh@...nel.org, sdf@...gle.com, haoluo@...gle.com,
jolsa@...nel.org, davem@...emloft.net, dsahern@...nel.org,
kuba@...nel.org, pabeni@...hat.com, laoar.shao@...il.com,
netdev@...r.kernel.org, linux-kernel@...r.kernel.org,
bpf@...r.kernel.org, yangzhenze@...edance.com,
wangdongdong.6@...edance.com
Subject: Re: [PATCH bpf-next] bpf: tcp: Improve bpf write tcp opt performance
On Fri, May 17, 2024 at 03:27 PM +08, Feng Zhou wrote:
> 在 2024/5/17 01:15, Jakub Sitnicki 写道:
>> On Thu, May 16, 2024 at 11:15 AM +08, Feng Zhou wrote:
>>> 在 2024/5/15 17:48, Jakub Sitnicki 写道:
[...]
>> If it's not the BPF prog, which you have ruled out, then where are we
>> burining cycles? Maybe that is something that can be improved.
>> Also, in terms on quantifying the improvement - it is 20% in terms of
>> what? Throughput, pps, cycles? And was that a single data point? For
>> multiple measurements there must be some variance (+/- X pp).
>> Would be great to see some data to back it up.
>> [...]
>>
>
> Pressure measurement method:
>
> server: sockperf sr --tcp -i x.x.x.x -p 7654 --daemonize
> client: taskset -c 8 sockperf tp --tcp -i x.x.x.x -p 7654 -m 1200 -t 30
>
> Default mode, no bpf prog:
>
> taskset -c 8 sockperf tp --tcp -i x.x.x.x -p 7654 -m 1200 -t 30
> sockperf: == version #3.10-23.gited92afb185e6 ==
> sockperf[CLIENT] send on:
> [ 0] IP = x.x.x.x PORT = 7654 # TCP
> sockperf: Warmup stage (sending a few dummy messages)...
> sockperf: Starting test...
> sockperf: Test end (interrupted by timer)
> sockperf: Test ended
> sockperf: Total of 71520808 messages sent in 30.000 sec
>
> sockperf: NOTE: test was performed, using msg-size=1200. For getting maximum
> throughput consider using --msg-size=1472
> sockperf: Summary: Message Rate is 2384000 [msg/sec]
> sockperf: Summary: BandWidth is 2728.271 MBps (21826.172 Mbps)
>
> perf record --call-graph fp -e cycles:k -C 8 -- sleep 10
> perf report
>
> 80.88%--sock_sendmsg
> 79.53%--tcp_sendmsg
> 42.48%--tcp_sendmsg_locked
> 16.23%--_copy_from_iter
> 4.24%--tcp_send_mss
> 3.25%--tcp_current_mss
>
>
> perf top -C 8
>
> 19.13% [kernel] [k] _raw_spin_lock_bh
> 11.75% [kernel] [k] copy_user_enhanced_fast_string
> 9.86% [kernel] [k] tcp_sendmsg_locked
> 4.44% sockperf [.]
> _Z14client_handlerI10IoRecvfrom9SwitchOff13PongModeNeverEviii
> 4.16% libpthread-2.28.so [.] __libc_sendto
> 3.85% [kernel] [k] syscall_return_via_sysret
> 2.70% [kernel] [k] _copy_from_iter
> 2.48% [kernel] [k] entry_SYSCALL_64
> 2.33% [kernel] [k] native_queued_spin_lock_slowpath
> 1.89% [kernel] [k] __virt_addr_valid
> 1.77% [kernel] [k] __check_object_size
> 1.75% [kernel] [k] __sys_sendto
> 1.74% [kernel] [k] entry_SYSCALL_64_after_hwframe
> 1.42% [kernel] [k] __fget_light
> 1.28% [kernel] [k] tcp_push
> 1.01% [kernel] [k] tcp_established_options
> 0.97% [kernel] [k] tcp_send_mss
> 0.94% [kernel] [k] syscall_exit_to_user_mode_prepare
> 0.94% [kernel] [k] tcp_sendmsg
> 0.86% [kernel] [k] tcp_current_mss
>
> Having bpf prog to write tcp opt in all pkts:
>
> taskset -c 8 sockperf tp --tcp -i x.x.x.x -p 7654 -m 1200 -t 30
> sockperf: == version #3.10-23.gited92afb185e6 ==
> sockperf[CLIENT] send on:
> [ 0] IP = x.x.x.x PORT = 7654 # TCP
> sockperf: Warmup stage (sending a few dummy messages)...
> sockperf: Starting test...
> sockperf: Test end (interrupted by timer)
> sockperf: Test ended
> sockperf: Total of 60636218 messages sent in 30.000 sec
>
> sockperf: NOTE: test was performed, using msg-size=1200. For getting maximum
> throughput consider using --msg-size=1472
> sockperf: Summary: Message Rate is 2021185 [msg/sec]
> sockperf: Summary: BandWidth is 2313.063 MBps (18504.501 Mbps)
>
> perf record --call-graph fp -e cycles:k -C 8 -- sleep 10
> perf report
>
> 80.30%--sock_sendmsg
> 79.02%--tcp_sendmsg
> 54.14%--tcp_sendmsg_locked
> 12.82%--_copy_from_iter
> 12.51%--tcp_send_mss
> 11.77%--tcp_current_mss
> 10.10%--tcp_established_options
> 8.75%--bpf_skops_hdr_opt_len.isra.54
> 5.71%--__cgroup_bpf_run_filter_sock_ops
> 3.32%--bpf_prog_e7ccbf819f5be0d0_tcpopt
> 6.61%--__tcp_push_pending_frames
> 6.60%--tcp_write_xmit
> 5.89%--__tcp_transmit_skb
>
> perf top -C 8
>
> 10.98% [kernel] [k] _raw_spin_lock_bh
> 9.04% [kernel] [k] copy_user_enhanced_fast_string
> 7.78% [kernel] [k] tcp_sendmsg_locked
> 3.91% sockperf [.]
> _Z14client_handlerI10IoRecvfrom9SwitchOff13PongModeNeverEviii
> 3.46% libpthread-2.28.so [.] __libc_sendto
> 3.35% [kernel] [k] syscall_return_via_sysret
> 2.86% [kernel] [k] bpf_skops_hdr_opt_len.isra54
> 2.16% [kernel] [k] __htab_map_lookup_elem
> 2.11% [kernel] [k] _copy_from_iter
> 2.09% [kernel] [k] entry_SYSCALL_64
> 1.97% [kernel] [k] __virt_addr_valid
> 1.95% [kernel] [k] __cgroup_bpf_run_filter_sock_ops
> 1.95% [kernel] [k] lookup_nulls_elem_raw
> 1.89% [kernel] [k] __fget_light
> 1.42% [kernel] [k] __sys_sendto
> 1.41% [kernel] [k] entry_SYSCALL_64_after_hwframe
> 1.31% [kernel] [k] native_queued_spin_lock_slowpath
> 1.22% [kernel] [k] __check_object_size
> 1.18% [kernel] [k] tcp_established_options
> 1.04% bpf_prog_e7ccbf819f5be0d0_tcpopt [k] bpf_prog_e7ccbf819f5be0d0_tcpopt
>
> Compare the above test results, fill up a CPU, you can find that
> the upper limit of qps or BandWidth has a loss of nearly 18-20%.
> Then CPU occupancy, you can find that "tcp_send_mss" has increased
> significantly.
This helps prove the point, but what I actually had in mind is to check
"perf annotate bpf_skops_hdr_opt_len" and see if there any low hanging
fruit there which we can optimize.
For instance, when I benchmark it in a VM, I see we're spending cycles
mostly memset()/rep stos. I have no idea where the cycles are spent in
your case.
>
>>>>> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
>>>>> index 90706a47f6ff..f2092de1f432 100644
>>>>> --- a/tools/include/uapi/linux/bpf.h
>>>>> +++ b/tools/include/uapi/linux/bpf.h
>>>>> @@ -6892,8 +6892,14 @@ enum {
>>>>> * options first before the BPF program does.
>>>>> */
>>>>> BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6),
>>>>> + /* Fast path to reserve space in a skb under
>>>>> + * sock_ops->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB.
>>>>> + * opt length doesn't change often, so it can save in the tcp_sock. And
>>>>> + * set BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG to no bpf call.
>>>>> + */
>>>>> + BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG = (1<<7),
>>>> Have you considered a bpf_reserve_hdr_opt() flag instead?
>>>> An example or test coverage would to show this API extension in action
>>>> would help.
>>>>
>>>
>>> bpf_reserve_hdr_opt () flag can't finish this. I want to optimize
>>> that bpf prog will not be triggered frequently before TSO. Provide
>>> a way for users to not trigger bpf prog when opt len is unchanged.
>>> Then when writing opt, if len changes, clear the flag, and then
>>> change opt len in the next package.
>> I haven't seen a sample using the API extenstion that you're proposing,
>> so I can only guess. But you probably have something like:
>> SEC("sockops")
>> int sockops_prog(struct bpf_sock_ops *ctx)
>> {
>> if (ctx->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB &&
>> ctx->args[0] == BPF_WRITE_HDR_TCP_CURRENT_MSS) {
>> bpf_reserve_hdr_opt(ctx, N, 0);
>> bpf_sock_ops_cb_flags_set(ctx,
>> ctx->bpf_sock_ops_cb_flags |
>> MY_NEW_FLAG);
>> return 1;
>> }
>> }
>
> Yes, that's what I expected.
>
>> I don't understand why you're saying it can't be transformed into:
>> int sockops_prog(struct bpf_sock_ops *ctx)
>> {
>> if (ctx->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB &&
>> ctx->args[0] == BPF_WRITE_HDR_TCP_CURRENT_MSS) {
>> bpf_reserve_hdr_opt(ctx, N, MY_NEW_FLAG);
>> return 1;
>> }
>> }
>
> "bpf_reserve_hdr_opt (ctx, N, MY_NEW_FLAG);"
>
> I don't know what I can do to pass the flag parameter, let
> "bpf_reserve_hdr_opt" return quickly? But this is not useful,
> because the loss caused by the triggering of bpf prog is very
> expensive, and it is still on the hotspot function of sending
> packets, and the TSO has not been completed yet.
>
>> [...]
This is not what I'm suggesting.
bpf_reserve_hdr_opt() has access to bpf_sock_ops_kern and even the
sock. You could either signal through bpf_sock_ops_kern to
bpf_skops_hdr_opt_len() that it should not be called again
Or even configure the tcp_sock directly from bpf_reserve_hdr_opt()
because it has access to it via bpf_sock_ops_kern{}.sk.
Powered by blists - more mailing lists