linux-kernel - Re: [PATCH bpf-next v2 02/18] x86,bpf: add bpf_global

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <45f4d349-7b08-45d3-9bec-3ab75217f9b6@linux.dev>
Date: Tue, 15 Jul 2025 16:36:57 +0800
From: Menglong Dong <menglong.dong@...ux.dev>
To: Alexei Starovoitov <alexei.starovoitov@...il.com>,
 Menglong Dong <menglong8.dong@...il.com>
Cc: Steven Rostedt <rostedt@...dmis.org>, Jiri Olsa <jolsa@...nel.org>,
 bpf <bpf@...r.kernel.org>, Menglong Dong <dongml2@...natelecom.cn>,
 "H. Peter Anvin" <hpa@...or.com>, Martin KaFai Lau <martin.lau@...ux.dev>,
 Eduard Zingerman <eddyz87@...il.com>, Song Liu <song@...nel.org>,
 Yonghong Song <yonghong.song@...ux.dev>,
 John Fastabend <john.fastabend@...il.com>, KP Singh <kpsingh@...nel.org>,
 Stanislav Fomichev <sdf@...ichev.me>, Hao Luo <haoluo@...gle.com>,
 LKML <linux-kernel@...r.kernel.org>,
 Network Development <netdev@...r.kernel.org>
Subject: Re: [PATCH bpf-next v2 02/18] x86,bpf: add bpf_global_caller for
 global trampoline


On 7/15/25 10:25, Alexei Starovoitov wrote:
> On Thu, Jul 3, 2025 at 5:17 AM Menglong Dong <menglong8.dong@...il.com> wrote:
>> +static __always_inline void
>> +do_origin_call(unsigned long *args, unsigned long *ip, int nr_args)
>> +{
>> +       /* Following code will be optimized by the compiler, as nr_args
>> +        * is a const, and there will be no condition here.
>> +        */
>> +       if (nr_args == 0) {
>> +               asm volatile(
>> +                       RESTORE_ORIGIN_0 CALL_NOSPEC "\n"
>> +                       "movq %%rax, %0\n"
>> +                       : "=m"(args[nr_args]), ASM_CALL_CONSTRAINT
>> +                       : [args]"r"(args), [thunk_target]"r"(*ip)
>> +                       :
>> +               );
>> +       } else if (nr_args == 1) {
>> +               asm volatile(
>> +                       RESTORE_ORIGIN_1 CALL_NOSPEC "\n"
>> +                       "movq %%rax, %0\n"
>> +                       : "=m"(args[nr_args]), ASM_CALL_CONSTRAINT
>> +                       : [args]"r"(args), [thunk_target]"r"(*ip)
>> +                       : "rdi"
>> +               );
>> +       } else if (nr_args == 2) {
>> +               asm volatile(
>> +                       RESTORE_ORIGIN_2 CALL_NOSPEC "\n"
>> +                       "movq %%rax, %0\n"
>> +                       : "=m"(args[nr_args]), ASM_CALL_CONSTRAINT
>> +                       : [args]"r"(args), [thunk_target]"r"(*ip)
>> +                       : "rdi", "rsi"
>> +               );
>> +       } else if (nr_args == 3) {
>> +               asm volatile(
>> +                       RESTORE_ORIGIN_3 CALL_NOSPEC "\n"
>> +                       "movq %%rax, %0\n"
>> +                       : "=m"(args[nr_args]), ASM_CALL_CONSTRAINT
>> +                       : [args]"r"(args), [thunk_target]"r"(*ip)
>> +                       : "rdi", "rsi", "rdx"
>> +               );
>> +       } else if (nr_args == 4) {
>> +               asm volatile(
>> +                       RESTORE_ORIGIN_4 CALL_NOSPEC "\n"
>> +                       "movq %%rax, %0\n"
>> +                       : "=m"(args[nr_args]), ASM_CALL_CONSTRAINT
>> +                       : [args]"r"(args), [thunk_target]"r"(*ip)
>> +                       : "rdi", "rsi", "rdx", "rcx"
>> +               );
>> +       } else if (nr_args == 5) {
>> +               asm volatile(
>> +                       RESTORE_ORIGIN_5 CALL_NOSPEC "\n"
>> +                       "movq %%rax, %0\n"
>> +                       : "=m"(args[nr_args]), ASM_CALL_CONSTRAINT
>> +                       : [args]"r"(args), [thunk_target]"r"(*ip)
>> +                       : "rdi", "rsi", "rdx", "rcx", "r8"
>> +               );
>> +       } else if (nr_args == 6) {
>> +               asm volatile(
>> +                       RESTORE_ORIGIN_6 CALL_NOSPEC "\n"
>> +                       "movq %%rax, %0\n"
>> +                       : "=m"(args[nr_args]), ASM_CALL_CONSTRAINT
>> +                       : [args]"r"(args), [thunk_target]"r"(*ip)
>> +                       : "rdi", "rsi", "rdx", "rcx", "r8", "r9"
>> +               );
>> +       }
>> +}
> What is the performance difference between 0-6 variants?
> I would think save/restore of regs shouldn't be that expensive.
> bpf trampoline saves only what's necessary because it can do
> this micro optimization, but for this one, I think, doing
> _one_ global trampoline that covers all cases will simplify the code
> a lot, but please benchmark the difference to understand
> the trade-off.

According to my benchmark, it has ~5% overhead to save/restore
*5* variants when compared with *0* variant. The save/restore of regs
is fast, but it still need 12 insn, which can produce ~6% overhead.

I think the performance is more import and we should keep this logic.
Should we? If you think the do_origin_call() is not simple enough, we
can recover all the 6 regs from the stack directly for the origin call, 
which won't
introduce too much overhead, and keep the save/restore logic.

What do you think?


>
> The major simplification will be due to skipping nr_args.
> There won't be a need to do btf model and count the args.
> Just do one trampoline for them all.
>
> Also funcs with 7+ arguments need to be thought through
> from the start.


In the current version, the attachment will be rejected if any functions 
have
7+ arguments.


> I think it's ok trade-off if we allow global trampoline
> to be safe to attach to a function with 7+ args (and
> it will not mess with the stack), but bpf prog can only
> access up to 6 args. The kfuncs to access arg 7 might be
> more complex and slower. It's ok trade off.


It's OK for fentry-multi, but we can't allow fexit-multi and 
modify_return-multi
to be attached to the function with 7+ args, as we need to do the origin
call, and we can't recover the arguments in the stack for the origin 
call for now.

So we can allow the functions with 7+ args to be attached as long as the 
accessed
arguments are all in regs for fentry-multi. And I think we need one more 
patch to
do the "all accessed arguments are in regs" checking, so maybe we can 
put it in
the next series? As current series is a little complex :/

Anyway, I'll have a try to see if we can add this part in this series :)


>
>> +
>> +static __always_inline notrace void
>> +run_tramp_prog(struct kfunc_md_tramp_prog *tramp_prog,
>> +              struct bpf_tramp_run_ctx *run_ctx, unsigned long *args)
>> +{
>> +       struct bpf_prog *prog;
>> +       u64 start_time;
>> +
>> +       while (tramp_prog) {
>> +               prog = tramp_prog->prog;
>> +               run_ctx->bpf_cookie = tramp_prog->cookie;
>> +               start_time = bpf_gtramp_enter(prog, run_ctx);
>> +
>> +               if (likely(start_time)) {
>> +                       asm volatile(
>> +                               CALL_NOSPEC "\n"
>> +                               : : [thunk_target]"r"(prog->bpf_func), [args]"D"(args)
>> +                       );
> Why this cannot be "call *(prog->bpf_func)" ?

Do you mean "prog->bpf_func(args, NULL);"? In my previous testing, this 
cause
bad performance, and I see others do the indirect call in this way. And 
I just do
the benchmark again, it seems the performance is not affected in this 
way anymore.
So I think I can replace it with "prog->bpf_func(args, NULL);" in the 
next version.

>
>> +               }
>> +
>> +               bpf_gtramp_exit(prog, start_time, run_ctx);
>> +               tramp_prog = tramp_prog->next;
>> +       }
>> +}
>> +
>> +static __always_inline notrace int
>> +bpf_global_caller_run(unsigned long *args, unsigned long *ip, int nr_args)
> Pls share top 10 from "perf report" while running the bench.
> I'm curious about what's hot.
> Last time I benchmarked fentry/fexit migrate_disable/enable were
> one the hottest functions. I suspect it's the case here as well.


You are right, the migrate_disable/enable are the hottest functions in
both bpf trampoline and global trampoline. Following is the perf top
for fentry-multi:
36.36% bpf_prog_2dcccf652aac1793_bench_trigger_fentry_multi [k] 
bpf_prog_2dcccf652aac1793_bench_trigger_fentry_multi 20.54% [kernel] [k] 
migrate_enable 19.35% [kernel] [k] bpf_global_caller_5_run 6.52% 
[kernel] [k] bpf_global_caller_5 3.58% libc.so.6 [.] syscall 2.88% 
[kernel] [k] entry_SYSCALL_64 1.50% [kernel] [k] memchr_inv 1.39% 
[kernel] [k] fput 1.04% [kernel] [k] migrate_disable 0.91% [kernel] [k] 
_copy_to_user

And I also did the testing for fentry:

54.63% bpf_prog_2dcccf652aac1793_bench_trigger_fentry [k] 
bpf_prog_2dcccf652aac1793_bench_trigger_fentry
10.43% [kernel] [k] migrate_enable
10.07% bpf_trampoline_6442517037 [k] bpf_trampoline_6442517037
8.06% [kernel] [k] __bpf_prog_exit_recur 4.11% libc.so.6 [.] syscall 
2.15% [kernel] [k] entry_SYSCALL_64 1.48% [kernel] [k] memchr_inv 1.32% 
[kernel] [k] fput 1.16% [kernel] [k] _copy_to_user 0.73% [kernel] [k] 
bpf_prog_test_run_raw_tp
The migrate_enable/disable are used to do the recursive checking,
and I even wanted to perform recursive checks in the same way as
ftrace to eliminate this overhead :/


>