linux-kernel - Re: [PATCH v2] x86,ibt: Use UDB instead of 0xEA

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250902104627.GM4068168@noisy.programming.kicks-ass.net>
Date: Tue, 2 Sep 2025 12:46:27 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: x86@...nel.org, "H. Peter Anvin" <hpa@...or.com>
Cc: kees@...nel.org, alyssa.milburn@...el.com, scott.d.constable@...el.com,
	joao@...rdrivepizza.com, andrew.cooper3@...rix.com,
	samitolvanen@...gle.com, nathan@...nel.org,
	alexei.starovoitov@...il.com, mhiramat@...nel.org, ojeda@...nel.org,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] x86,ibt: Use UDB instead of 0xEA

On Tue, Sep 02, 2025 at 10:19:15AM +0200, Peter Zijlstra wrote:

> Caller:
> 
>   FineIBT                               Paranoid-FineIBT
> 
>   fineibt_caller:                       fineibt_caller:
>     mov     $0x12345678, %eax             mov    $0x12345678, %eax
>     lea     -10(%r11), %r11               cmp    -0x11(%r11), %eax
>     nop5                                  cs lea -0x10(%r11), %r11
>   retpoline:                            retpoline:
>     cs call __x86_indirect_thunk_r11      jne    fineibt_caller+0xd
>                                           call   *%r11
>                                           nop
> 
> Notably this is before apply_retpolines() which will fix up the
> retpoline call -- since all parts with IBT also have eIBRS (lets
> ignore ITS). Typically the retpoline site is rewritten (when still
> intact) into:
> 
>     call *r11
>     nop3
> 

> And now I'm going to have to do a patch that makes apply_retpoline()
> do CS padding instead of NOP padding for CALL... 

Finding the exact prefix decode penalties for uarchs that have
eIBRS/BHI_NO is not a fun time. I've stuck to the general wisdom that 3
prefixes is mostly good (notably, the instruction at hand has no 0x0f
escape which is sometimes counted towards the prefix budget -- it can
have a REX prefix, but those are generally not counted towards the
prefix budget).

In general Intel P-cores do not have prefix decode penalties, but the
E-cores (or rather the Atom line) generally does. And since this all
runs on hybrid cores, the code must accommodate them.

I hate all this.

---
Subject: x86,retpoline: Optimize patch_retpoline()
From: Peter Zijlstra <peterz@...radead.org>
Date: Tue Sep 2 11:20:35 CEST 2025

Currently the very common retpoline: "CS CALL __x86_indirect_thunk_r11"
is transformed into "CALL *R11; NOP3" for eIBRS/BHI_NO parts.

Similarly, paranoid fineibt has: "CALL *R11; NOP".

Recognise that CS stuffing can avoid the extra NOP. However, due to
prefix decode penalties, make sure to not emit too many CS prefixes.
Notably: "CS CALL __x86_indirect_thunk_rax" must not become "CS CS CS
CS CALL *RAX". Prefix decode penalties are typically many more cycles
than decoding an extra NOP.

Additionally, if the retpoline is a tail-call, the "JMP *%\reg" should
be followed by INT3 for straight-line-speculation mitigation, since
emit_indirect() now has a length argument, move this into
emit_indirect() such that other users (paranoid-fineibt) also do this.

Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
---
 arch/x86/kernel/alternative.c |   40 +++++++++++++++++++++++++---------------
 1 file changed, 25 insertions(+), 15 deletions(-)

--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -715,18 +715,31 @@ static inline bool is_jcc32(struct insn
 /*
  * CALL/JMP *%\reg
  */
-static int emit_indirect(int op, int reg, u8 *bytes)
+static int emit_indirect(int op, int reg, u8 *bytes, int len)
 {
+	int cs = 0, bp = 0;
 	int i = 0;
 	u8 modrm;
 
+	/*
+	 * Set @len to the excess bytes after writing the instruction.
+	 */
+	len -= 2 + (reg >= 8);
+	WARN_ON_ONCE(len < 0);
+
 	switch (op) {
 	case CALL_INSN_OPCODE:
 		modrm = 0x10; /* Reg = 2; CALL r/m */
+		/*
+		 * Additional NOP is better than prefix decode penalty.
+		 */
+		if (len <= 3)
+			cs = len;
 		break;
 
 	case JMP32_INSN_OPCODE:
 		modrm = 0x20; /* Reg = 4; JMP r/m */
+		bp = !!len;
 		break;
 
 	default:
@@ -734,6 +747,9 @@ static int emit_indirect(int op, int reg
 		return -1;
 	}
 
+	while (cs--)
+		bytes[i++] = 0x2e; /* CS-prefix */
+
 	if (reg >= 8) {
 		bytes[i++] = 0x41; /* REX.B prefix */
 		reg -= 8;
@@ -745,6 +761,9 @@ static int emit_indirect(int op, int reg
 	bytes[i++] = 0xff; /* opcode */
 	bytes[i++] = modrm;
 
+	if (bp)
+		bytes[i++] = 0xcc; /* INT3 */
+
 	return i;
 }
 
@@ -918,20 +937,11 @@ static int patch_retpoline(void *addr, s
 		return emit_its_trampoline(addr, insn, reg, bytes);
 #endif
 
-	ret = emit_indirect(op, reg, bytes + i);
+	ret = emit_indirect(op, reg, bytes + i, insn->length - i);
 	if (ret < 0)
 		return ret;
 	i += ret;
 
-	/*
-	 * The compiler is supposed to EMIT an INT3 after every unconditional
-	 * JMP instruction due to AMD BTC. However, if the compiler is too old
-	 * or MITIGATION_SLS isn't enabled, we still need an INT3 after
-	 * indirect JMPs even on Intel.
-	 */
-	if (op == JMP32_INSN_OPCODE && i < insn->length)
-		bytes[i++] = INT3_INSN_OPCODE;
-
 	for (; i < insn->length;)
 		bytes[i++] = BYTES_NOP1;
 
@@ -1418,8 +1428,7 @@ asm(	".pushsection .rodata				\n"
 	"#fineibt_caller_size:                          \n"
 	"	jne	fineibt_paranoid_start+0xd	\n"
 	"fineibt_paranoid_ind:				\n"
-	"	call	*%r11				\n"
-	"	nop					\n"
+	"	cs call	*%r11				\n"
 	"fineibt_paranoid_end:				\n"
 	".popsection					\n"
 );
@@ -1721,8 +1730,9 @@ static int cfi_rewrite_callers(s32 *star
 			emit_paranoid_trampoline(addr + fineibt_caller_size,
 						 &insn, 11, bytes + fineibt_caller_size);
 		} else {
-			ret = emit_indirect(op, 11, bytes + fineibt_paranoid_ind);
-			if (WARN_ON_ONCE(ret != 3))
+			int len = fineibt_paranoid_size - fineibt_paranoid_ind;
+			ret = emit_indirect(op, 11, bytes + fineibt_paranoid_ind, len);
+			if (WARN_ON_ONCE(ret != len))
 				continue;
 		}