[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <06bb3780-7ba0-4d88-b212-5e5b7a1b92cb@bytedance.com>
Date: Wed, 5 Jun 2024 15:37:01 +0800
From: Feng Zhou <zhoufeng.zf@...edance.com>
To: Jakub Sitnicki <jakub@...udflare.com>
Cc: edumazet@...gle.com, ast@...nel.org, daniel@...earbox.net,
andrii@...nel.org, martin.lau@...ux.dev, eddyz87@...il.com, song@...nel.org,
yonghong.song@...ux.dev, john.fastabend@...il.com, kpsingh@...nel.org,
sdf@...gle.com, haoluo@...gle.com, jolsa@...nel.org, davem@...emloft.net,
dsahern@...nel.org, kuba@...nel.org, pabeni@...hat.com,
laoar.shao@...il.com, netdev@...r.kernel.org, linux-kernel@...r.kernel.org,
bpf@...r.kernel.org, yangzhenze@...edance.com, wangdongdong.6@...edance.com
Subject: Re: Re: [PATCH bpf-next] bpf: tcp: Improve bpf write tcp opt
performance
在 2024/5/31 18:45, Jakub Sitnicki 写道:
> On Fri, May 17, 2024 at 03:27 PM +08, Feng Zhou wrote:
>> 在 2024/5/17 01:15, Jakub Sitnicki 写道:
>>> On Thu, May 16, 2024 at 11:15 AM +08, Feng Zhou wrote:
>>>> 在 2024/5/15 17:48, Jakub Sitnicki 写道:
>
> [...]
>
>>> If it's not the BPF prog, which you have ruled out, then where are we
>>> burining cycles? Maybe that is something that can be improved.
>>> Also, in terms on quantifying the improvement - it is 20% in terms of
>>> what? Throughput, pps, cycles? And was that a single data point? For
>>> multiple measurements there must be some variance (+/- X pp).
>>> Would be great to see some data to back it up.
>>> [...]
>>>
>>
>> Pressure measurement method:
>>
>> server: sockperf sr --tcp -i x.x.x.x -p 7654 --daemonize
>> client: taskset -c 8 sockperf tp --tcp -i x.x.x.x -p 7654 -m 1200 -t 30
>>
>> Default mode, no bpf prog:
>>
>> taskset -c 8 sockperf tp --tcp -i x.x.x.x -p 7654 -m 1200 -t 30
>> sockperf: == version #3.10-23.gited92afb185e6 ==
>> sockperf[CLIENT] send on:
>> [ 0] IP = x.x.x.x PORT = 7654 # TCP
>> sockperf: Warmup stage (sending a few dummy messages)...
>> sockperf: Starting test...
>> sockperf: Test end (interrupted by timer)
>> sockperf: Test ended
>> sockperf: Total of 71520808 messages sent in 30.000 sec
>>
>> sockperf: NOTE: test was performed, using msg-size=1200. For getting maximum
>> throughput consider using --msg-size=1472
>> sockperf: Summary: Message Rate is 2384000 [msg/sec]
>> sockperf: Summary: BandWidth is 2728.271 MBps (21826.172 Mbps)
>>
>> perf record --call-graph fp -e cycles:k -C 8 -- sleep 10
>> perf report
>>
>> 80.88%--sock_sendmsg
>> 79.53%--tcp_sendmsg
>> 42.48%--tcp_sendmsg_locked
>> 16.23%--_copy_from_iter
>> 4.24%--tcp_send_mss
>> 3.25%--tcp_current_mss
>>
>>
>> perf top -C 8
>>
>> 19.13% [kernel] [k] _raw_spin_lock_bh
>> 11.75% [kernel] [k] copy_user_enhanced_fast_string
>> 9.86% [kernel] [k] tcp_sendmsg_locked
>> 4.44% sockperf [.]
>> _Z14client_handlerI10IoRecvfrom9SwitchOff13PongModeNeverEviii
>> 4.16% libpthread-2.28.so [.] __libc_sendto
>> 3.85% [kernel] [k] syscall_return_via_sysret
>> 2.70% [kernel] [k] _copy_from_iter
>> 2.48% [kernel] [k] entry_SYSCALL_64
>> 2.33% [kernel] [k] native_queued_spin_lock_slowpath
>> 1.89% [kernel] [k] __virt_addr_valid
>> 1.77% [kernel] [k] __check_object_size
>> 1.75% [kernel] [k] __sys_sendto
>> 1.74% [kernel] [k] entry_SYSCALL_64_after_hwframe
>> 1.42% [kernel] [k] __fget_light
>> 1.28% [kernel] [k] tcp_push
>> 1.01% [kernel] [k] tcp_established_options
>> 0.97% [kernel] [k] tcp_send_mss
>> 0.94% [kernel] [k] syscall_exit_to_user_mode_prepare
>> 0.94% [kernel] [k] tcp_sendmsg
>> 0.86% [kernel] [k] tcp_current_mss
>>
>> Having bpf prog to write tcp opt in all pkts:
>>
>> taskset -c 8 sockperf tp --tcp -i x.x.x.x -p 7654 -m 1200 -t 30
>> sockperf: == version #3.10-23.gited92afb185e6 ==
>> sockperf[CLIENT] send on:
>> [ 0] IP = x.x.x.x PORT = 7654 # TCP
>> sockperf: Warmup stage (sending a few dummy messages)...
>> sockperf: Starting test...
>> sockperf: Test end (interrupted by timer)
>> sockperf: Test ended
>> sockperf: Total of 60636218 messages sent in 30.000 sec
>>
>> sockperf: NOTE: test was performed, using msg-size=1200. For getting maximum
>> throughput consider using --msg-size=1472
>> sockperf: Summary: Message Rate is 2021185 [msg/sec]
>> sockperf: Summary: BandWidth is 2313.063 MBps (18504.501 Mbps)
>>
>> perf record --call-graph fp -e cycles:k -C 8 -- sleep 10
>> perf report
>>
>> 80.30%--sock_sendmsg
>> 79.02%--tcp_sendmsg
>> 54.14%--tcp_sendmsg_locked
>> 12.82%--_copy_from_iter
>> 12.51%--tcp_send_mss
>> 11.77%--tcp_current_mss
>> 10.10%--tcp_established_options
>> 8.75%--bpf_skops_hdr_opt_len.isra.54
>> 5.71%--__cgroup_bpf_run_filter_sock_ops
>> 3.32%--bpf_prog_e7ccbf819f5be0d0_tcpopt
>> 6.61%--__tcp_push_pending_frames
>> 6.60%--tcp_write_xmit
>> 5.89%--__tcp_transmit_skb
>>
>> perf top -C 8
>>
>> 10.98% [kernel] [k] _raw_spin_lock_bh
>> 9.04% [kernel] [k] copy_user_enhanced_fast_string
>> 7.78% [kernel] [k] tcp_sendmsg_locked
>> 3.91% sockperf [.]
>> _Z14client_handlerI10IoRecvfrom9SwitchOff13PongModeNeverEviii
>> 3.46% libpthread-2.28.so [.] __libc_sendto
>> 3.35% [kernel] [k] syscall_return_via_sysret
>> 2.86% [kernel] [k] bpf_skops_hdr_opt_len.isra.54
>> 2.16% [kernel] [k] __htab_map_lookup_elem
>> 2.11% [kernel] [k] _copy_from_iter
>> 2.09% [kernel] [k] entry_SYSCALL_64
>> 1.97% [kernel] [k] __virt_addr_valid
>> 1.95% [kernel] [k] __cgroup_bpf_run_filter_sock_ops
>> 1.95% [kernel] [k] lookup_nulls_elem_raw
>> 1.89% [kernel] [k] __fget_light
>> 1.42% [kernel] [k] __sys_sendto
>> 1.41% [kernel] [k] entry_SYSCALL_64_after_hwframe
>> 1.31% [kernel] [k] native_queued_spin_lock_slowpath
>> 1.22% [kernel] [k] __check_object_size
>> 1.18% [kernel] [k] tcp_established_options
>> 1.04% bpf_prog_e7ccbf819f5be0d0_tcpopt [k] bpf_prog_e7ccbf819f5be0d0_tcpopt
>>
>> Compare the above test results, fill up a CPU, you can find that
>> the upper limit of qps or BandWidth has a loss of nearly 18-20%.
>> Then CPU occupancy, you can find that "tcp_send_mss" has increased
>> significantly.
>
> This helps prove the point, but what I actually had in mind is to check
> "perf annotate bpf_skops_hdr_opt_len" and see if there any low hanging
> fruit there which we can optimize.
>
> For instance, when I benchmark it in a VM, I see we're spending cycles
> mostly memset()/rep stos. I have no idea where the cycles are spent in
> your case.
>
>>
How do you do your pressure test? Can you send it to me for a try? Or
you can try my pressure test method. Have you checked the calling
frequency of bpf_skops_hdr_opt_len and bpf_skops_write_hdr_opt?
>>>>>> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
>>>>>> index 90706a47f6ff..f2092de1f432 100644
>>>>>> --- a/tools/include/uapi/linux/bpf.h
>>>>>> +++ b/tools/include/uapi/linux/bpf.h
>>>>>> @@ -6892,8 +6892,14 @@ enum {
>>>>>> * options first before the BPF program does.
>>>>>> */
>>>>>> BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6),
>>>>>> + /* Fast path to reserve space in a skb under
>>>>>> + * sock_ops->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB.
>>>>>> + * opt length doesn't change often, so it can save in the tcp_sock. And
>>>>>> + * set BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG to no bpf call.
>>>>>> + */
>>>>>> + BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG = (1<<7),
>>>>> Have you considered a bpf_reserve_hdr_opt() flag instead?
>>>>> An example or test coverage would to show this API extension in action
>>>>> would help.
>>>>>
>>>>
>>>> bpf_reserve_hdr_opt () flag can't finish this. I want to optimize
>>>> that bpf prog will not be triggered frequently before TSO. Provide
>>>> a way for users to not trigger bpf prog when opt len is unchanged.
>>>> Then when writing opt, if len changes, clear the flag, and then
>>>> change opt len in the next package.
>>> I haven't seen a sample using the API extenstion that you're proposing,
>>> so I can only guess. But you probably have something like:
>>> SEC("sockops")
>>> int sockops_prog(struct bpf_sock_ops *ctx)
>>> {
>>> if (ctx->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB &&
>>> ctx->args[0] == BPF_WRITE_HDR_TCP_CURRENT_MSS) {
>>> bpf_reserve_hdr_opt(ctx, N, 0);
>>> bpf_sock_ops_cb_flags_set(ctx,
>>> ctx->bpf_sock_ops_cb_flags |
>>> MY_NEW_FLAG);
>>> return 1;
>>> }
>>> }
>>
>> Yes, that's what I expected.
>>
>>> I don't understand why you're saying it can't be transformed into:
>>> int sockops_prog(struct bpf_sock_ops *ctx)
>>> {
>>> if (ctx->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB &&
>>> ctx->args[0] == BPF_WRITE_HDR_TCP_CURRENT_MSS) {
>>> bpf_reserve_hdr_opt(ctx, N, MY_NEW_FLAG);
>>> return 1;
>>> }
>>> }
>>
>> "bpf_reserve_hdr_opt (ctx, N, MY_NEW_FLAG);"
>>
>> I don't know what I can do to pass the flag parameter, let
>> "bpf_reserve_hdr_opt" return quickly? But this is not useful,
>> because the loss caused by the triggering of bpf prog is very
>> expensive, and it is still on the hotspot function of sending
>> packets, and the TSO has not been completed yet.
>>
>>> [...]
>
> This is not what I'm suggesting.
>
> bpf_reserve_hdr_opt() has access to bpf_sock_ops_kern and even the
> sock. You could either signal through bpf_sock_ops_kern to
> bpf_skops_hdr_opt_len() that it should not be called again
>
> Or even configure the tcp_sock directly from bpf_reserve_hdr_opt()
> because it has access to it via bpf_sock_ops_kern{}.sk.
Oh, I see what you mean, this will achieve the goal.
Powered by blists - more mailing lists