[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5029485.GXAFRqVoOG@7950hx>
Date: Wed, 05 Nov 2025 09:30:16 +0800
From: Menglong Dong <menglong.dong@...ux.dev>
To: Menglong Dong <menglong8.dong@...il.com>,
Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc: Alexei Starovoitov <ast@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>, Andrii Nakryiko <andrii@...nel.org>,
Martin KaFai Lau <martin.lau@...ux.dev>, Eduard <eddyz87@...il.com>,
Song Liu <song@...nel.org>, Yonghong Song <yonghong.song@...ux.dev>,
John Fastabend <john.fastabend@...il.com>, KP Singh <kpsingh@...nel.org>,
Stanislav Fomichev <sdf@...ichev.me>, Hao Luo <haoluo@...gle.com>,
Jiri Olsa <jolsa@...nel.org>, "David S. Miller" <davem@...emloft.net>,
David Ahern <dsahern@...nel.org>, Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
Dave Hansen <dave.hansen@...ux.intel.com>, X86 ML <x86@...nel.org>,
"H. Peter Anvin" <hpa@...or.com>, jiang.biao@...ux.dev,
bpf <bpf@...r.kernel.org>, Network Development <netdev@...r.kernel.org>,
LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH bpf-next] bpf,x86: do RSB balance for trampoline
On 2025/11/5 02:56, Alexei Starovoitov wrote:
> On Tue, Nov 4, 2025 at 2:49 AM Menglong Dong <menglong8.dong@...il.com> wrote:
> >
> > In origin call case, we skip the "rip" directly before we return, which
> > break the RSB, as we have twice "call", but only once "ret".
>
> RSB meaning return stack buffer?
>
> and by "breaks RSB" you mean it makes the cpu less efficient?
Yeah, I mean it makes the cpu less efficient. The RSB is used
for the branch predicting, and it will push the "rip" to its hardware
stack on "call", and pop it from the stack on "ret". In the origin
call case, there are twice "call" but once "ret", will break its
balance.
Similar things happen in "return_to_handler" in ftrace_64.S,
which has once "call", but twice "ret". And it pretend a "call"
to make it balance.
I were wandering why the overhead of fexit is much higher
than fentry. I added the percup-ref-get-and-put stuff to the
fentry, and the performance of it still can be 130M/s. However,
the fexit only has 76M/s. And the only difference is the origin
call.
The RSB balancing mitigate it, but there are still gap. I
suspect it's still the branch predicting things.
> Or you mean call depth accounting that is done in sw ?
>
> > Do the RSB balance by pseudo a "ret". Instead of skipping the "rip", we
> > modify it to the address of a "ret" insn that we generate.
> >
> > The performance of "fexit" increases from 76M/s to 84M/s. Before this
> > optimize, the bench resulting of fexit is:
> >
> > fexit : 76.494 ± 0.216M/s
> > fexit : 76.319 ± 0.097M/s
> > fexit : 70.680 ± 0.060M/s
> > fexit : 75.509 ± 0.039M/s
> > fexit : 76.392 ± 0.049M/s
> >
> > After this optimize:
> >
> > fexit : 86.023 ± 0.518M/s
> > fexit : 83.388 ± 0.021M/s
> > fexit : 85.146 ± 0.058M/s
> > fexit : 85.646 ± 0.136M/s
> > fexit : 84.040 ± 0.045M/s
>
> This is with or without calldepth accounting?
The CONFIG_MITIGATION_CALL_DEPTH_TRACKING is enabled, but
I did the testing with "mitigations=off" in the cmdline, so I guess
"without"?
>
> > Things become a little more complex, not sure if the benefits worth it :/
> >
> > Signed-off-by: Menglong Dong <dongml2@...natelecom.cn>
> > ---
> > arch/x86/net/bpf_jit_comp.c | 32 +++++++++++++++++++++++++++++---
> > 1 file changed, 29 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> > index d4c93d9e73e4..a9c2142a84d0 100644
> > --- a/arch/x86/net/bpf_jit_comp.c
> > +++ b/arch/x86/net/bpf_jit_comp.c
> > @@ -3185,6 +3185,7 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im
> > struct bpf_tramp_links *fmod_ret = &tlinks[BPF_TRAMP_MODIFY_RETURN];
> > void *orig_call = func_addr;
> > u8 **branches = NULL;
> > + u8 *rsb_pos;
> > u8 *prog;
> > bool save_ret;
> >
> > @@ -3431,17 +3432,42 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im
> > LOAD_TRAMP_TAIL_CALL_CNT_PTR(stack_size);
> > }
> >
> > + if (flags & BPF_TRAMP_F_SKIP_FRAME) {
> > + u64 ret_addr = (u64)(image + (prog - (u8 *)rw_image));
> > +
> > + rsb_pos = prog;
> > + /*
> > + * reserve the room to save the return address to rax:
> > + * movabs rax, imm64
> > + *
> > + * this is used to do the RSB balance. For the SKIP_FRAME
> > + * case, we do the "call" twice, but only have one "ret",
> > + * which can break the RSB.
> > + *
> > + * Therefore, instead of skipping the "rip", we make it as
> > + * a pseudo return: modify the "rip" in the stack to the
> > + * second "ret" address that we build bellow.
> > + */
> > + emit_mov_imm64(&prog, BPF_REG_0, ret_addr >> 32, (u32)ret_addr);
> > + /* mov [rbp + 8], rax */
> > + EMIT4(0x48, 0x89, 0x45, 0x08);
> > + }
> > +
> > /* restore return value of orig_call or fentry prog back into RAX */
> > if (save_ret)
> > emit_ldx(&prog, BPF_DW, BPF_REG_0, BPF_REG_FP, -8);
> >
> > emit_ldx(&prog, BPF_DW, BPF_REG_6, BPF_REG_FP, -rbx_off);
> > EMIT1(0xC9); /* leave */
> > + emit_return(&prog, image + (prog - (u8 *)rw_image));
> > if (flags & BPF_TRAMP_F_SKIP_FRAME) {
> > - /* skip our return address and return to parent */
> > - EMIT4(0x48, 0x83, 0xC4, 8); /* add rsp, 8 */
> > + u64 ret_addr = (u64)(image + (prog - (u8 *)rw_image));
> > +
> > + /* fix the return address to second return address */
> > + emit_mov_imm64(&rsb_pos, BPF_REG_0, ret_addr >> 32, (u32)ret_addr);
>
> So the first "movabs rax, imm64" is not needed ?
> Why compute ret_addr there and everything ?
> I mean it could have been prog += sizeof(movabs), right?
I did it before, but the thing is that the "sizeof(movabs)" in not
fixed according to the definition of emit_mov_imm64():
static void emit_mov_imm64(u8 **pprog, u32 dst_reg,
const u32 imm32_hi, const u32 imm32_lo)
{
u64 imm64 = ((u64)imm32_hi << 32) | (u32)imm32_lo;
u8 *prog = *pprog;
if (is_uimm32(imm64)) {
/*
* For emitting plain u32, where sign bit must not be
* propagated LLVM tends to load imm64 over mov32
* directly, so save couple of bytes by just doing
* 'mov %eax, imm32' instead.
*/
emit_mov_imm32(&prog, false, dst_reg, imm32_lo);
} else if (is_simm32(imm64)) {
emit_mov_imm32(&prog, true, dst_reg, imm32_lo);
} else {
/* movabsq rax, imm64 */
EMIT2(add_1mod(0x48, dst_reg), add_1reg(0xB8, dst_reg));
EMIT(imm32_lo, 4);
EMIT(imm32_hi, 4);
}
*pprog = prog;
}
I used "emit_mov_imm64(&prog, BPF_REG_0, 0, 0)" to take the placeholder,
but I failed, as the insn length is total different with
"emit_mov_imm64(&rsb_pos, BPF_REG_0, ret_addr >> 32, (u32)ret_addr);".
It's a little confusing here, I have not figure out a better way :/
Thanks!
Menglong Dong
>
> > + /* this is the second(real) return */
> > + emit_return(&prog, image + (prog - (u8 *)rw_image));
> > }
> > - emit_return(&prog, image + (prog - (u8 *)rw_image));
>
Powered by blists - more mailing lists