netdev - Re: [PATCH bpf-next] bpf,x86: do RSB balance for trampoline

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAADnVQKQXcUxjJ2uYNu1nvhFYt=KhN8QYAiGXrt_YwUsjMFOuA@mail.gmail.com>
Date: Tue, 4 Nov 2025 18:12:06 -0800
From: Alexei Starovoitov <alexei.starovoitov@...il.com>
To: Menglong Dong <menglong.dong@...ux.dev>
Cc: Menglong Dong <menglong8.dong@...il.com>, Alexei Starovoitov <ast@...nel.org>, 
	Daniel Borkmann <daniel@...earbox.net>, Andrii Nakryiko <andrii@...nel.org>, 
	Martin KaFai Lau <martin.lau@...ux.dev>, Eduard <eddyz87@...il.com>, Song Liu <song@...nel.org>, 
	Yonghong Song <yonghong.song@...ux.dev>, John Fastabend <john.fastabend@...il.com>, 
	KP Singh <kpsingh@...nel.org>, Stanislav Fomichev <sdf@...ichev.me>, Hao Luo <haoluo@...gle.com>, 
	Jiri Olsa <jolsa@...nel.org>, "David S. Miller" <davem@...emloft.net>, David Ahern <dsahern@...nel.org>, 
	Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>, 
	Dave Hansen <dave.hansen@...ux.intel.com>, X86 ML <x86@...nel.org>, 
	"H. Peter Anvin" <hpa@...or.com>, jiang.biao@...ux.dev, bpf <bpf@...r.kernel.org>, 
	Network Development <netdev@...r.kernel.org>, LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH bpf-next] bpf,x86: do RSB balance for trampoline

On Tue, Nov 4, 2025 at 5:30 PM Menglong Dong <menglong.dong@...ux.dev> wrote:
>
> On 2025/11/5 02:56, Alexei Starovoitov wrote:
> > On Tue, Nov 4, 2025 at 2:49 AM Menglong Dong <menglong8.dong@...il.com> wrote:
> > >
> > > In origin call case, we skip the "rip" directly before we return, which
> > > break the RSB, as we have twice "call", but only once "ret".
> >
> > RSB meaning return stack buffer?
> >
> > and by "breaks RSB" you mean it makes the cpu less efficient?
>
> Yeah, I mean it makes the cpu less efficient. The RSB is used
> for the branch predicting, and it will push the "rip" to its hardware
> stack on "call", and pop it from the stack on "ret". In the origin
> call case, there are twice "call" but once "ret", will break its
> balance.

Yes. I'm aware, but your "mov [rbp + 8], rax" screws it up as well,
since RSB has to be updated/invalidated by this store.
The behavior depends on the microarchitecture, of course.
I think:
add rsp, 8
ret
will only screw up the return prediction, but won't invalidate RSB.

> Similar things happen in "return_to_handler" in ftrace_64.S,
> which has once "call", but twice "ret". And it pretend a "call"
> to make it balance.

This makes more sense to me. Let's try that approach instead
of messing with the return address on stack?

> I were wandering why the overhead of fexit is much higher
> than fentry. I added the percup-ref-get-and-put stuff to the
> fentry, and the performance of it still can be 130M/s. However,
> the fexit only has 76M/s. And the only difference is the origin
> call.
>
> The RSB balancing mitigate it, but there are still gap. I
> suspect it's still the branch predicting things.

If you have access to intel vtune profiler, check what is
actually happening. It can show micro arch details.
I don't think there is an open source alternative.

> > Or you mean call depth accounting that is done in sw ?
> >
> > > Do the RSB balance by pseudo a "ret". Instead of skipping the "rip", we
> > > modify it to the address of a "ret" insn that we generate.
> > >
> > > The performance of "fexit" increases from 76M/s to 84M/s. Before this
> > > optimize, the bench resulting of fexit is:
> > >
> > > fexit          :   76.494 ± 0.216M/s
> > > fexit          :   76.319 ± 0.097M/s
> > > fexit          :   70.680 ± 0.060M/s
> > > fexit          :   75.509 ± 0.039M/s
> > > fexit          :   76.392 ± 0.049M/s
> > >
> > > After this optimize:
> > >
> > > fexit          :   86.023 ± 0.518M/s
> > > fexit          :   83.388 ± 0.021M/s
> > > fexit          :   85.146 ± 0.058M/s
> > > fexit          :   85.646 ± 0.136M/s
> > > fexit          :   84.040 ± 0.045M/s
> >
> > This is with or without calldepth accounting?
>
> The CONFIG_MITIGATION_CALL_DEPTH_TRACKING is enabled, but
> I did the testing with "mitigations=off" in the cmdline, so I guess
> "without"?

Pls benchmark both. It sounds like call_depth_tracking
miscounting stuff ?

> >
> > > Things become a little more complex, not sure if the benefits worth it :/
> > >
> > > Signed-off-by: Menglong Dong <dongml2@...natelecom.cn>
> > > ---
> > >  arch/x86/net/bpf_jit_comp.c | 32 +++++++++++++++++++++++++++++---
> > >  1 file changed, 29 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> > > index d4c93d9e73e4..a9c2142a84d0 100644
> > > --- a/arch/x86/net/bpf_jit_comp.c
> > > +++ b/arch/x86/net/bpf_jit_comp.c
> > > @@ -3185,6 +3185,7 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im
> > >         struct bpf_tramp_links *fmod_ret = &tlinks[BPF_TRAMP_MODIFY_RETURN];
> > >         void *orig_call = func_addr;
> > >         u8 **branches = NULL;
> > > +       u8 *rsb_pos;
> > >         u8 *prog;
> > >         bool save_ret;
> > >
> > > @@ -3431,17 +3432,42 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im
> > >                 LOAD_TRAMP_TAIL_CALL_CNT_PTR(stack_size);
> > >         }
> > >
> > > +       if (flags & BPF_TRAMP_F_SKIP_FRAME) {
> > > +               u64 ret_addr = (u64)(image + (prog - (u8 *)rw_image));
> > > +
> > > +               rsb_pos = prog;
> > > +               /*
> > > +                * reserve the room to save the return address to rax:
> > > +                *   movabs rax, imm64
> > > +                *
> > > +                * this is used to do the RSB balance. For the SKIP_FRAME
> > > +                * case, we do the "call" twice, but only have one "ret",
> > > +                * which can break the RSB.
> > > +                *
> > > +                * Therefore, instead of skipping the "rip", we make it as
> > > +                * a pseudo return: modify the "rip" in the stack to the
> > > +                * second "ret" address that we build bellow.
> > > +                */
> > > +               emit_mov_imm64(&prog, BPF_REG_0, ret_addr >> 32, (u32)ret_addr);
> > > +               /* mov [rbp + 8], rax */
> > > +               EMIT4(0x48, 0x89, 0x45, 0x08);
> > > +       }
> > > +
> > >         /* restore return value of orig_call or fentry prog back into RAX */
> > >         if (save_ret)
> > >                 emit_ldx(&prog, BPF_DW, BPF_REG_0, BPF_REG_FP, -8);
> > >
> > >         emit_ldx(&prog, BPF_DW, BPF_REG_6, BPF_REG_FP, -rbx_off);
> > >         EMIT1(0xC9); /* leave */
> > > +       emit_return(&prog, image + (prog - (u8 *)rw_image));
> > >         if (flags & BPF_TRAMP_F_SKIP_FRAME) {
> > > -               /* skip our return address and return to parent */
> > > -               EMIT4(0x48, 0x83, 0xC4, 8); /* add rsp, 8 */
> > > +               u64 ret_addr = (u64)(image + (prog - (u8 *)rw_image));
> > > +
> > > +               /* fix the return address to second return address */
> > > +               emit_mov_imm64(&rsb_pos, BPF_REG_0, ret_addr >> 32, (u32)ret_addr);
> >
> > So the first "movabs rax, imm64" is not needed ?
> > Why compute ret_addr there and everything ?
> > I mean it could have been prog += sizeof(movabs), right?
>
> I did it before, but the thing is that the "sizeof(movabs)" in not
> fixed according to the definition of emit_mov_imm64():
>
> static void emit_mov_imm64(u8 **pprog, u32 dst_reg,
>                            const u32 imm32_hi, const u32 imm32_lo)
> {
>         u64 imm64 = ((u64)imm32_hi << 32) | (u32)imm32_lo;
>         u8 *prog = *pprog;
>
>         if (is_uimm32(imm64)) {
>                 /*
>                  * For emitting plain u32, where sign bit must not be
>                  * propagated LLVM tends to load imm64 over mov32
>                  * directly, so save couple of bytes by just doing
>                  * 'mov %eax, imm32' instead.
>                  */
>                 emit_mov_imm32(&prog, false, dst_reg, imm32_lo);
>         } else if (is_simm32(imm64)) {
>                 emit_mov_imm32(&prog, true, dst_reg, imm32_lo);
>         } else {
>                 /* movabsq rax, imm64 */
>                 EMIT2(add_1mod(0x48, dst_reg), add_1reg(0xB8, dst_reg));
>                 EMIT(imm32_lo, 4);
>                 EMIT(imm32_hi, 4);

This part could be factored out as a separate helper.
Then sizeof(movabsq) will be constant.
Note, in the verifier we do:
#if defined(MODULES_VADDR)
                        u64 addr = MODULES_VADDR;
#else
                        u64 addr = VMALLOC_START;
#endif
                        /* jit (e.g. x86_64) may emit fewer instructions
                         * if it learns a u32 imm is the same as a u64 imm.
                         * Set close enough to possible prog address.
                         */
                        insn[0].imm = (u32)addr;
                        insn[1].imm = addr >> 32;

do mitigate this issue.
So you could have done:
emit_mov_imm64(&prog, BPF_REG_0, VMALLOC_START >> 32, 0);

since 'ret_addr' math is incorrect at that point anyway.
But prog += sizeof is imo cleaner.
The whole thing might not be necessary with extra call approach.
I suspect it should be faster than this approach.