[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250902104627.GM4068168@noisy.programming.kicks-ass.net>
Date: Tue, 2 Sep 2025 12:46:27 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: x86@...nel.org, "H. Peter Anvin" <hpa@...or.com>
Cc: kees@...nel.org, alyssa.milburn@...el.com, scott.d.constable@...el.com,
joao@...rdrivepizza.com, andrew.cooper3@...rix.com,
samitolvanen@...gle.com, nathan@...nel.org,
alexei.starovoitov@...il.com, mhiramat@...nel.org, ojeda@...nel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] x86,ibt: Use UDB instead of 0xEA
On Tue, Sep 02, 2025 at 10:19:15AM +0200, Peter Zijlstra wrote:
> Caller:
>
> FineIBT Paranoid-FineIBT
>
> fineibt_caller: fineibt_caller:
> mov $0x12345678, %eax mov $0x12345678, %eax
> lea -10(%r11), %r11 cmp -0x11(%r11), %eax
> nop5 cs lea -0x10(%r11), %r11
> retpoline: retpoline:
> cs call __x86_indirect_thunk_r11 jne fineibt_caller+0xd
> call *%r11
> nop
>
> Notably this is before apply_retpolines() which will fix up the
> retpoline call -- since all parts with IBT also have eIBRS (lets
> ignore ITS). Typically the retpoline site is rewritten (when still
> intact) into:
>
> call *r11
> nop3
>
> And now I'm going to have to do a patch that makes apply_retpoline()
> do CS padding instead of NOP padding for CALL...
Finding the exact prefix decode penalties for uarchs that have
eIBRS/BHI_NO is not a fun time. I've stuck to the general wisdom that 3
prefixes is mostly good (notably, the instruction at hand has no 0x0f
escape which is sometimes counted towards the prefix budget -- it can
have a REX prefix, but those are generally not counted towards the
prefix budget).
In general Intel P-cores do not have prefix decode penalties, but the
E-cores (or rather the Atom line) generally does. And since this all
runs on hybrid cores, the code must accommodate them.
I hate all this.
---
Subject: x86,retpoline: Optimize patch_retpoline()
From: Peter Zijlstra <peterz@...radead.org>
Date: Tue Sep 2 11:20:35 CEST 2025
Currently the very common retpoline: "CS CALL __x86_indirect_thunk_r11"
is transformed into "CALL *R11; NOP3" for eIBRS/BHI_NO parts.
Similarly, paranoid fineibt has: "CALL *R11; NOP".
Recognise that CS stuffing can avoid the extra NOP. However, due to
prefix decode penalties, make sure to not emit too many CS prefixes.
Notably: "CS CALL __x86_indirect_thunk_rax" must not become "CS CS CS
CS CALL *RAX". Prefix decode penalties are typically many more cycles
than decoding an extra NOP.
Additionally, if the retpoline is a tail-call, the "JMP *%\reg" should
be followed by INT3 for straight-line-speculation mitigation, since
emit_indirect() now has a length argument, move this into
emit_indirect() such that other users (paranoid-fineibt) also do this.
Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
---
arch/x86/kernel/alternative.c | 40 +++++++++++++++++++++++++---------------
1 file changed, 25 insertions(+), 15 deletions(-)
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -715,18 +715,31 @@ static inline bool is_jcc32(struct insn
/*
* CALL/JMP *%\reg
*/
-static int emit_indirect(int op, int reg, u8 *bytes)
+static int emit_indirect(int op, int reg, u8 *bytes, int len)
{
+ int cs = 0, bp = 0;
int i = 0;
u8 modrm;
+ /*
+ * Set @len to the excess bytes after writing the instruction.
+ */
+ len -= 2 + (reg >= 8);
+ WARN_ON_ONCE(len < 0);
+
switch (op) {
case CALL_INSN_OPCODE:
modrm = 0x10; /* Reg = 2; CALL r/m */
+ /*
+ * Additional NOP is better than prefix decode penalty.
+ */
+ if (len <= 3)
+ cs = len;
break;
case JMP32_INSN_OPCODE:
modrm = 0x20; /* Reg = 4; JMP r/m */
+ bp = !!len;
break;
default:
@@ -734,6 +747,9 @@ static int emit_indirect(int op, int reg
return -1;
}
+ while (cs--)
+ bytes[i++] = 0x2e; /* CS-prefix */
+
if (reg >= 8) {
bytes[i++] = 0x41; /* REX.B prefix */
reg -= 8;
@@ -745,6 +761,9 @@ static int emit_indirect(int op, int reg
bytes[i++] = 0xff; /* opcode */
bytes[i++] = modrm;
+ if (bp)
+ bytes[i++] = 0xcc; /* INT3 */
+
return i;
}
@@ -918,20 +937,11 @@ static int patch_retpoline(void *addr, s
return emit_its_trampoline(addr, insn, reg, bytes);
#endif
- ret = emit_indirect(op, reg, bytes + i);
+ ret = emit_indirect(op, reg, bytes + i, insn->length - i);
if (ret < 0)
return ret;
i += ret;
- /*
- * The compiler is supposed to EMIT an INT3 after every unconditional
- * JMP instruction due to AMD BTC. However, if the compiler is too old
- * or MITIGATION_SLS isn't enabled, we still need an INT3 after
- * indirect JMPs even on Intel.
- */
- if (op == JMP32_INSN_OPCODE && i < insn->length)
- bytes[i++] = INT3_INSN_OPCODE;
-
for (; i < insn->length;)
bytes[i++] = BYTES_NOP1;
@@ -1418,8 +1428,7 @@ asm( ".pushsection .rodata \n"
"#fineibt_caller_size: \n"
" jne fineibt_paranoid_start+0xd \n"
"fineibt_paranoid_ind: \n"
- " call *%r11 \n"
- " nop \n"
+ " cs call *%r11 \n"
"fineibt_paranoid_end: \n"
".popsection \n"
);
@@ -1721,8 +1730,9 @@ static int cfi_rewrite_callers(s32 *star
emit_paranoid_trampoline(addr + fineibt_caller_size,
&insn, 11, bytes + fineibt_caller_size);
} else {
- ret = emit_indirect(op, 11, bytes + fineibt_paranoid_ind);
- if (WARN_ON_ONCE(ret != 3))
+ int len = fineibt_paranoid_size - fineibt_paranoid_ind;
+ ret = emit_indirect(op, 11, bytes + fineibt_paranoid_ind, len);
+ if (WARN_ON_ONCE(ret != len))
continue;
}
Powered by blists - more mailing lists