lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e9ee3c7ffc3ba6feb97247faa40789684e39ffd0.1741778537.git.kai.huang@intel.com>
Date: Thu, 13 Mar 2025 00:34:13 +1300
From: Kai Huang <kai.huang@...el.com>
To: dave.hansen@...el.com,
	bp@...en8.de,
	tglx@...utronix.de,
	peterz@...radead.org,
	mingo@...hat.com,
	kirill.shutemov@...ux.intel.com
Cc: hpa@...or.com,
	x86@...nel.org,
	linux-kernel@...r.kernel.org,
	pbonzini@...hat.com,
	seanjc@...gle.com,
	rick.p.edgecombe@...el.com,
	reinette.chatre@...el.com,
	isaku.yamahata@...el.com,
	dan.j.williams@...el.com,
	thomas.lendacky@....com,
	ashish.kalra@....com,
	dwmw@...zon.co.uk,
	bhe@...hat.com,
	nik.borisov@...e.com,
	sagis@...gle.com,
	Dave Young <dyoung@...hat.com>
Subject: [RFC PATCH 1/5] x86/kexec: Do unconditional WBINVD for bare-metal in stop_this_cpu()

TL;DR:

Change to do unconditional WBINVD in stop_this_cpu() for bare metal to
cover kexec support for both AMD SME and Intel TDX.  Previously there
_was_ some issue preventing from doing so but now it has been fixed.

Long version:

AMD SME uses the C-bit to determine whether to encrypt the memory or
not.  For the same physical memory address, dirty cachelines with and
without the C-bit can coexist and the CPU can flush them back to memory
in random order.  To support kexec for SME, the old kernel uses WBINVD
to flush cache before booting to the new kernel so that no stale dirty
cacheline are left over by the old kernel which could otherwise corrupt
the new kernel's memory.

TDX uses 'KeyID' bits in the physical address for memory encryption and
has the same requirement.  To support kexec for TDX, the old kernel
needs to flush cache of TDX private memory.

Currently, the kernel only performs WBINVD in stop_this_cpu() when SME
is supported by hardware.  Perform unconditional WBINVD to support TDX
instead of adding one more vendor-specific check.  Kexec is a slow path,
and the additional WBINVD is acceptable for the sake of simplicity and
maintainability.

Only do WBINVD on bare-metal.  Doing WBINVD in guest triggers unexpected
exception (#VE or #VC) for TDX and SEV-ES/SEV-SNP guests and the guest
may not be able to handle such exception (e.g., TDX guests panics if it
sees such #VE).

History of SME and kexec WBINVD:

There _was_ an issue preventing doing unconditional WBINVD but that has
been fixed.

Initial SME kexec support added an unconditional WBINVD in commit

  bba4ed011a52: ("x86/mm, kexec: Allow kexec to be used with SME")

This commit caused different Intel systems to hang or reset.

Without a clear root cause, a later commit

  f23d74f6c66c: ("x86/mm: Rework wbinvd, hlt operation in stop_this_cpu()")

fixed the Intel system hang issues by only doing WBINVD when hardware
supports SME.

A corner case [*] revealed the root cause of the system hang issues and
was fixed by commit

  1f5e7eb7868e: ("x86/smp: Make stop_other_cpus() more robust")

See [1][2] for more information.

Further testing of doing unconditional WBINVD based on the above fix on
the problematic machines (that issues were originally reported)
confirmed the issues couldn't be reproduced.

See [3][4] for more information.

Therefore, it is safe to do unconditional WBINVD for bare-metal now.

[*] The commit didn't check whether the CPUID leaf is available or not.
Making unsupported CPUID leaf on Intel returns garbage resulting in
unintended WBINVD which caused some issue (followed by the analysis and
the reveal of the final root cause).  The corner case was independently
fixed by commit

  9b040453d444: ("x86/smp: Dont access non-existing CPUID leaf")

Link: https://lore.kernel.org/lkml/28a494ca-3173-4072-921c-6c5f5b257e79@amd.com/ [1]
Link: https://lore.kernel.org/lkml/24844584-8031-4b58-ba5c-f85ef2f4c718@amd.com/ [2]
Link: https://lore.kernel.org/lkml/20240221092856.GAZdXCWGJL7c9KLewv@fat_crate.local/ [3]
Link: https://lore.kernel.org/lkml/CALu+AoSZkq1kz-xjvHkkuJ3C71d0SM5ibEJurdgmkZqZvNp2dQ@mail.gmail.com/ [4]
Signed-off-by: Kai Huang <kai.huang@...el.com>
Suggested-by: Borislav Petkov <bp@...en8.de>
Cc: Tom Lendacky <thomas.lendacky@....com>
Cc: Dave Young <dyoung@...hat.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@....com>
---
 arch/x86/kernel/process.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 9c75d701011f..8475d9d2d8c4 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -819,18 +819,19 @@ void __noreturn stop_this_cpu(void *dummy)
 	mcheck_cpu_clear(c);
 
 	/*
-	 * Use wbinvd on processors that support SME. This provides support
-	 * for performing a successful kexec when going from SME inactive
-	 * to SME active (or vice-versa). The cache must be cleared so that
-	 * if there are entries with the same physical address, both with and
-	 * without the encryption bit, they don't race each other when flushed
-	 * and potentially end up with the wrong entry being committed to
-	 * memory.
+	 * Use wbinvd to support kexec for both SME (from inactive to active
+	 * or vice-versa) and TDX.  The cache must be cleared so that if there
+	 * are entries with the same physical address, both with and without
+	 * the encryption bit(s), they don't race each other when flushed and
+	 * potentially end up with the wrong entry being committed to memory.
 	 *
-	 * Test the CPUID bit directly because the machine might've cleared
-	 * X86_FEATURE_SME due to cmdline options.
+	 * stop_this_cpu() isn't a fast path, just do unconditional WBINVD for
+	 * bare-metal to cover both SME and TDX.  Do not do WBINVD in a guest
+	 * since performing one will result in an exception (#VE or #VC) for
+	 * TDX or SEV-ES/SEV-SNP guests which the guest may not be able to
+	 * handle (e.g., TDX guest panics if it sees #VE).
 	 */
-	if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
+	if (!boot_cpu_has(X86_FEATURE_HYPERVISOR))
 		wbinvd();
 
 	/*
-- 
2.48.1


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ