[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <cover.1741778537.git.kai.huang@intel.com>
Date: Thu, 13 Mar 2025 00:34:12 +1300
From: Kai Huang <kai.huang@...el.com>
To: dave.hansen@...el.com,
bp@...en8.de,
tglx@...utronix.de,
peterz@...radead.org,
mingo@...hat.com,
kirill.shutemov@...ux.intel.com
Cc: hpa@...or.com,
x86@...nel.org,
linux-kernel@...r.kernel.org,
pbonzini@...hat.com,
seanjc@...gle.com,
rick.p.edgecombe@...el.com,
reinette.chatre@...el.com,
isaku.yamahata@...el.com,
dan.j.williams@...el.com,
thomas.lendacky@....com,
ashish.kalra@....com,
dwmw@...zon.co.uk,
bhe@...hat.com,
nik.borisov@...e.com,
sagis@...gle.com
Subject: [RFC PATCH 0/5] TDX host: kexec/kdump support
Hi Dave,
This series is not ready for your review, but we want to move the
discussion external at this point. Please feel free to ignore it until
we get it a bit more polished.
-----------------------------------------------------------------------
TDX hosts do not currently support kexec. CONFIG_KEXEC_CORE must be
disabled in order to enable CONFIG_INTEL_TDX_HOST. This is not
acceptable to distros at least Redhat since they want both to be turned
on. There are other customers who want to use kexec together with TDX.
This series adds TDX host kexec support. With CONFIG_KEXEC_CORE and
CONFIG_INTEL_TDX_HOST both enabled, a TDX enabled kernel can kexec into
a new kernel and the kdump (crash kexec) can work as normal.
One limitation is if the old kernel has ever enabled TDX, the new kernel
cannot use TDX. This is a future work.
One exception is that kexec/kdump is disabled when the platform has the
TDX "partial write machine check" erratum (and when the
CONFIG_INTEL_TDX_HOST is turned on). See below for more information.
This was supposed to be a v8, but I tagged this series as RFC because
in the recent internal review I feel there's one point regarding the use
of MOVDIR64B to reset TDX private memory that I want to get feedback on
the list. Please see section "2) Reset TDX private memory using
MOVDIR64B" below for more information.
v7 -> this RFC:
The major change is, for the sake of keeping code change minimal, I
removed the patches which handle resetting TDX private memory to make
kexec work with the TDX erratum. Instead, add a patch to simply disable
kexec/kdump for such platforms.
v7: https://lore.kernel.org/lkml/cover.1727179214.git.kai.huang@intel.com/
=== More information ===
TDX private memory is memory that is encrypted with private Host Key IDs
(HKID). If the kernel has ever enabled TDX, part of system memory
remains TDX private memory when kexec happens. E.g., the PAMT (Physical
Address Metadata Table) pages used by the TDX module to track each TDX
memory page's state are never freed once the TDX module is initialized.
TDX guests also have guest private memory and secure-EPT pages.
After kexec, the new kernel will have no knowledge of which memory page
was used as TDX private page and can use all memory as regular memory.
1) Cache flush
Per TDX 1.5 base spec "8.6.1.Platforms not Using ACT: Required Cache
Flush and Initialization by the Host VMM", to support kexec for TDX, the
kernel needs to flush cache to make sure there's no dirty cachelines of
TDX private memory left over to the new kernel (when the TDX module
reports TDX_FEATURES.CLFLUSH_BEFORE_ALLOC as 1 in the global metadata for
the platform). The kernel also needs to make sure there's no more TDX
activity (no SEAMCALL) after cache flush so that no new dirty cachelines
of TDX private memory are generated.
SME has similar requirement. SME kexec support uses WBINVD to do the
cache flush. WBINVD is able to flush cachelines associated with any
HKID. Reuse the WBINVD introduced by SME to flush cache for TDX.
Currently the kernel explicitly checks whether the hardware supports SME
and only does WBINVD if true. Instead of adding yet another TDX
specific check, this series does unconditional WBINVD for bare-metal for
code simplicity since kexec is a slow path.
2) Reset TDX private memory using MOVDIR64B
The TDX spec (the aforementioned section) also suggests the kernel
*should* use MOVDIR64B to clear TDX private page before the kernel
reuses it as regular one.
However, in reality the situation can be more flexible. Per TDX 1.5
base spec ("Table 16.2: Non-ACT Platforms Checks on Memory Reads in Ci
Mode" and "Table 16.3: Non-ACT Platforms Checks on Memory Reads in Li
Mode"), the read/write to TDX private memory using shared KeyID without
integrity check enabled will not poison the memory and cause machine
check.
Note on the platforms with ACT (Access Control Table), there's no
integrity check involved thus no machine check is possible to happen due
to memory read/write using different KeyIDs.
Since it is not trivial to reset TDX private memory, this series assumes
KeyID 0 doesn't have integrity check enabled, and chooses to NOT reset
TDX private memory but leave TDX private memory as-is to the new kernel.
As mentioned above, in practice it is safe to do so.
The worst case is someday Intel decides to enable integrity check for
KeyID 0 for some new platforms, and the impact is the old kernels
running on those platforms may get machine check after kexec. But this
change will not happen silently. We can have a patch to reset TDX
private memory for those platforms and backport to stable. In the
meantime, we can enjoy the performance gain until that happens.
3) One limitation
If the kernel has ever enabled TDX, after kexec the new kernel won't be
able to use TDX anymore. This is because when the new kernel tries to
initialize TDX module it will fail on the first SEAMCALL due to the
module has already been initialized by the old kernel.
More (non-trivial) work will be needed for the new kernel to use TDX,
e.g., one solution is to just reload the TDX module from the location
where BIOS loads the TDX module (/boot/efi/EFI/TDX/). This series
doesn't cover this, but leave this as future work.
4) Kdump support
This series also enables kdump with TDX, but no special handling is
needed for crash kexec (except turning on the Kconfig option):
- kdump kernel uses reserved memory from the old kernel as system ram,
and the old kernel will never use the reserved memory as TDX memory.
- /proc/vmcore contains TDX private memory pages. It's meaningless to
read them, but it doesn't do any harm either.
5) TDX "partial write machine check" erratum
On the platform with TDX erratum, a partial write (a write transaction
of less than a cacheline lands at memory controller) to TDX private
memory poisons that memory, and a subsequent read triggers machine
check. On those platforms, the kernel needs to reset TDX private memory
before jumping to the new kernel otherwise the new kernel may see
unexpected machine check.
The kernel currently doesn't track which page is TDX private memory.
It's not trivial to reset TDX private memory. For simplicity, this
series simply disables kexec/kdump for such platforms. This will be
enhanced in the future.
Kai Huang (5):
x86/kexec: Do unconditional WBINVD for bare-metal in stop_this_cpu()
x86/kexec: Do unconditional WBINVD for bare-metal in relocate_kernel()
x86/kexec: Disable kexec/kdump on platforms with TDX partial write
erratum
x86/virt/tdx: Remove the !KEXEC_CORE dependency
x86/virt/tdx: Update the kexec section in the TDX documentation
Documentation/arch/x86/tdx.rst | 17 +++++++++-------
arch/x86/Kconfig | 1 -
arch/x86/include/asm/kexec.h | 2 +-
arch/x86/kernel/machine_kexec_64.c | 30 ++++++++++++++++++++--------
arch/x86/kernel/process.c | 21 +++++++++----------
arch/x86/kernel/relocate_kernel_64.S | 15 +++++++++-----
6 files changed, 54 insertions(+), 32 deletions(-)
--
2.48.1
Powered by blists - more mailing lists