linux-kernel - Re: [PATCH bpf-next] bpf,x86: do RSB balance for trampoline

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2388519.ElGaqSPkdT@7950hx>
Date: Thu, 06 Nov 2025 10:49:21 +0800
From: Menglong Dong <menglong.dong@...ux.dev>
To: Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc: Peter Zijlstra <peterz@...radead.org>,
 Menglong Dong <menglong8.dong@...il.com>,
 Alexei Starovoitov <ast@...nel.org>, Daniel Borkmann <daniel@...earbox.net>,
 Andrii Nakryiko <andrii@...nel.org>, Martin KaFai Lau <martin.lau@...ux.dev>,
 Eduard <eddyz87@...il.com>, Song Liu <song@...nel.org>,
 Yonghong Song <yonghong.song@...ux.dev>,
 John Fastabend <john.fastabend@...il.com>, KP Singh <kpsingh@...nel.org>,
 Stanislav Fomichev <sdf@...ichev.me>, Hao Luo <haoluo@...gle.com>,
 Jiri Olsa <jolsa@...nel.org>, "David S. Miller" <davem@...emloft.net>,
 David Ahern <dsahern@...nel.org>, Thomas Gleixner <tglx@...utronix.de>,
 Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
 Dave Hansen <dave.hansen@...ux.intel.com>, X86 ML <x86@...nel.org>,
 "H. Peter Anvin" <hpa@...or.com>, jiang.biao@...ux.dev,
 bpf <bpf@...r.kernel.org>, Network Development <netdev@...r.kernel.org>,
 LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH bpf-next] bpf,x86: do RSB balance for trampoline

On 2025/11/6 09:40, Menglong Dong wrote:
> On 2025/11/6 07:31, Alexei Starovoitov wrote:
> > On Tue, Nov 4, 2025 at 11:47 PM Menglong Dong <menglong.dong@...ux.dev> wrote:
> > >
> > > On 2025/11/5 15:13, Menglong Dong wrote:
> > > > On 2025/11/5 10:12, Alexei Starovoitov wrote:
> > > > > On Tue, Nov 4, 2025 at 5:30 PM Menglong Dong <menglong.dong@...ux.dev> wrote:
> > > > > >
> > > > > > On 2025/11/5 02:56, Alexei Starovoitov wrote:
> > > > > > > On Tue, Nov 4, 2025 at 2:49 AM Menglong Dong <menglong8.dong@...il.com> wrote:
> > > > > > > >
> > > > > > > > In origin call case, we skip the "rip" directly before we return, which
> > > > > > > > break the RSB, as we have twice "call", but only once "ret".
> > > > > > >
> > > > > > > RSB meaning return stack buffer?
> > > > > > >
> > > > > > > and by "breaks RSB" you mean it makes the cpu less efficient?
> > > > > >
> > > > > > Yeah, I mean it makes the cpu less efficient. The RSB is used
> > > > > > for the branch predicting, and it will push the "rip" to its hardware
> > > > > > stack on "call", and pop it from the stack on "ret". In the origin
> > > > > > call case, there are twice "call" but once "ret", will break its
> > > > > > balance.
> > > > >
> > > > > Yes. I'm aware, but your "mov [rbp + 8], rax" screws it up as well,
> > > > > since RSB has to be updated/invalidated by this store.
> > > > > The behavior depends on the microarchitecture, of course.
> > > > > I think:
> > > > > add rsp, 8
> > > > > ret
> > > > > will only screw up the return prediction, but won't invalidate RSB.
> > > > >
> > > > > > Similar things happen in "return_to_handler" in ftrace_64.S,
> > > > > > which has once "call", but twice "ret". And it pretend a "call"
> > > > > > to make it balance.
> > > > >
> > > > > This makes more sense to me. Let's try that approach instead
> > > > > of messing with the return address on stack?
> > > >
> > > > The way here is similar to the "return_to_handler". For the ftrace,
> > > > the origin stack before the "ret" of the traced function is:
> > > >
> > > >     POS:
> > > >     rip   ---> return_to_handler
> > > >
> > > > And the exit of the traced function will jump to return_to_handler.
> > > > In return_to_handler, it will query the real "rip" of the traced function
> > > > and the it call a internal function:
> > > >
> > > >     call .Ldo_rop
> > > >
> > > > And the stack now is:
> > > >
> > > >     POS:
> > > >     rip   ----> the address after "call .Ldo_rop", which is a "int3"
> > > >
> > > > in the .Ldo_rop, it will modify the rip to the real rip to make
> > > > it like this:
> > > >
> > > >     POS:
> > > >     rip   ---> real rip
> > > >
> > > > And it return. Take the target function "foo" for example, the logic
> > > > of it is:
> > > >
> > > >     call foo -> call ftrace_caller -> return ftrace_caller ->
> > > >     return return_to_handler -> call Ldo_rop -> return foo
> > > >
> > > > As you can see, the call and return address for ".Ldo_rop" is
> > > > also messed up. So I think it works here too. Compared with
> > > > a messed "return address", a missed return maybe have
> > > > better influence?
> > > >
> > > > And the whole logic for us is:
> > > >
> > > >     call foo -> call trampoline -> call origin ->
> > > >     return origin -> return POS -> return foo
> > >
> > > The "return POS" will miss the RSB, but the later return
> > > will hit it.
> > >
> > > The origin logic is:
> > >
> > >      call foo -> call trampoline -> call origin ->
> > >      return origin -> return foo
> > >
> > > The "return foo" and all the later return will miss the RBS.
> > >
> > > Hmm......Not sure if I understand it correctly.
> > 
> > Here another idea...
> > hack tr->func.ftrace_managed = false temporarily
> > and use BPF_MOD_JUMP in bpf_arch_text_poke()
> > when installing trampoline with fexit progs.
> > and also do:
> > @@ -3437,10 +3437,6 @@ static int __arch_prepare_bpf_trampoline(struct
> > bpf_tramp_image *im, void *rw_im
> > 
> >         emit_ldx(&prog, BPF_DW, BPF_REG_6, BPF_REG_FP, -rbx_off);
> >         EMIT1(0xC9); /* leave */
> > -       if (flags & BPF_TRAMP_F_SKIP_FRAME) {
> > -               /* skip our return address and return to parent */
> > -               EMIT4(0x48, 0x83, 0xC4, 8); /* add rsp, 8 */
> > -       }
> >         emit_return(&prog, image + (prog - (u8 *)rw_image));
> > 
> > Then RSB is perfectly matched without messing up the stack
> > and/or extra calls.
> > If it works and performance is good the next step is to
> > teach ftrace to emit jmp or call in *_ftrace_direct()

After the modification, the performance of fexit increase from
76M/s to 137M/s, awesome!

> 
> Good idea. I saw the "return_to_handler" used "JMP_NOSPEC", and
> the jmp is converted to the "fake call" to be nice to IBT in this commit:
> 
> e52fc2cf3f66 ("x86/ibt,ftrace: Make function-graph play nice")
> 
> It's not indirect branch in our case, but let me do more testing to
> see if there are any unexpected effect if we use "jmp" here.
> 
> Thanks!
> Menglong Dong
> 
> > 
> 
> 
> 
> 
> 
>