netdev - Re: [PATCH bpf-next] bpf,x86: do RSB balance for trampoline

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAADnVQJTOFjXe5=01KfOnBD86YU_Vy1YGezLQum3LnhFHAD+gg@mail.gmail.com>
Date: Tue, 4 Nov 2025 10:56:09 -0800
From: Alexei Starovoitov <alexei.starovoitov@...il.com>
To: Menglong Dong <menglong8.dong@...il.com>
Cc: Alexei Starovoitov <ast@...nel.org>, Daniel Borkmann <daniel@...earbox.net>, 
	Andrii Nakryiko <andrii@...nel.org>, Martin KaFai Lau <martin.lau@...ux.dev>, Eduard <eddyz87@...il.com>, 
	Song Liu <song@...nel.org>, Yonghong Song <yonghong.song@...ux.dev>, 
	John Fastabend <john.fastabend@...il.com>, KP Singh <kpsingh@...nel.org>, 
	Stanislav Fomichev <sdf@...ichev.me>, Hao Luo <haoluo@...gle.com>, Jiri Olsa <jolsa@...nel.org>, 
	"David S. Miller" <davem@...emloft.net>, David Ahern <dsahern@...nel.org>, 
	Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>, 
	Dave Hansen <dave.hansen@...ux.intel.com>, X86 ML <x86@...nel.org>, 
	"H. Peter Anvin" <hpa@...or.com>, jiang.biao@...ux.dev, 
	Menglong Dong <menglong.dong@...ux.dev>, bpf <bpf@...r.kernel.org>, 
	Network Development <netdev@...r.kernel.org>, LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH bpf-next] bpf,x86: do RSB balance for trampoline

On Tue, Nov 4, 2025 at 2:49 AM Menglong Dong <menglong8.dong@...il.com> wrote:
>
> In origin call case, we skip the "rip" directly before we return, which
> break the RSB, as we have twice "call", but only once "ret".

RSB meaning return stack buffer?

and by "breaks RSB" you mean it makes the cpu less efficient?
Or you mean call depth accounting that is done in sw ?

> Do the RSB balance by pseudo a "ret". Instead of skipping the "rip", we
> modify it to the address of a "ret" insn that we generate.
>
> The performance of "fexit" increases from 76M/s to 84M/s. Before this
> optimize, the bench resulting of fexit is:
>
> fexit          :   76.494 ± 0.216M/s
> fexit          :   76.319 ± 0.097M/s
> fexit          :   70.680 ± 0.060M/s
> fexit          :   75.509 ± 0.039M/s
> fexit          :   76.392 ± 0.049M/s
>
> After this optimize:
>
> fexit          :   86.023 ± 0.518M/s
> fexit          :   83.388 ± 0.021M/s
> fexit          :   85.146 ± 0.058M/s
> fexit          :   85.646 ± 0.136M/s
> fexit          :   84.040 ± 0.045M/s

This is with or without calldepth accounting?

> Things become a little more complex, not sure if the benefits worth it :/
>
> Signed-off-by: Menglong Dong <dongml2@...natelecom.cn>
> ---
>  arch/x86/net/bpf_jit_comp.c | 32 +++++++++++++++++++++++++++++---
>  1 file changed, 29 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index d4c93d9e73e4..a9c2142a84d0 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -3185,6 +3185,7 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im
>         struct bpf_tramp_links *fmod_ret = &tlinks[BPF_TRAMP_MODIFY_RETURN];
>         void *orig_call = func_addr;
>         u8 **branches = NULL;
> +       u8 *rsb_pos;
>         u8 *prog;
>         bool save_ret;
>
> @@ -3431,17 +3432,42 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im
>                 LOAD_TRAMP_TAIL_CALL_CNT_PTR(stack_size);
>         }
>
> +       if (flags & BPF_TRAMP_F_SKIP_FRAME) {
> +               u64 ret_addr = (u64)(image + (prog - (u8 *)rw_image));
> +
> +               rsb_pos = prog;
> +               /*
> +                * reserve the room to save the return address to rax:
> +                *   movabs rax, imm64
> +                *
> +                * this is used to do the RSB balance. For the SKIP_FRAME
> +                * case, we do the "call" twice, but only have one "ret",
> +                * which can break the RSB.
> +                *
> +                * Therefore, instead of skipping the "rip", we make it as
> +                * a pseudo return: modify the "rip" in the stack to the
> +                * second "ret" address that we build bellow.
> +                */
> +               emit_mov_imm64(&prog, BPF_REG_0, ret_addr >> 32, (u32)ret_addr);
> +               /* mov [rbp + 8], rax */
> +               EMIT4(0x48, 0x89, 0x45, 0x08);
> +       }
> +
>         /* restore return value of orig_call or fentry prog back into RAX */
>         if (save_ret)
>                 emit_ldx(&prog, BPF_DW, BPF_REG_0, BPF_REG_FP, -8);
>
>         emit_ldx(&prog, BPF_DW, BPF_REG_6, BPF_REG_FP, -rbx_off);
>         EMIT1(0xC9); /* leave */
> +       emit_return(&prog, image + (prog - (u8 *)rw_image));
>         if (flags & BPF_TRAMP_F_SKIP_FRAME) {
> -               /* skip our return address and return to parent */
> -               EMIT4(0x48, 0x83, 0xC4, 8); /* add rsp, 8 */
> +               u64 ret_addr = (u64)(image + (prog - (u8 *)rw_image));
> +
> +               /* fix the return address to second return address */
> +               emit_mov_imm64(&rsb_pos, BPF_REG_0, ret_addr >> 32, (u32)ret_addr);

So the first "movabs rax, imm64" is not needed ?
Why compute ret_addr there and everything ?
I mean it could have been prog += sizeof(movabs), right?

> +               /* this is the second(real) return */
> +               emit_return(&prog, image + (prog - (u8 *)rw_image));
>         }
> -       emit_return(&prog, image + (prog - (u8 *)rw_image));