[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f999349e-accb-dcd6-75f4-eb36e0dda79f@amd.com>
Date: Mon, 21 Jul 2025 08:08:51 -0500
From: Tom Lendacky <thomas.lendacky@....com>
To: Kai Huang <kai.huang@...el.com>, dave.hansen@...el.com, bp@...en8.de,
tglx@...utronix.de, peterz@...radead.org, mingo@...hat.com, hpa@...or.com
Cc: x86@...nel.org, kas@...nel.org, rick.p.edgecombe@...el.com,
dwmw@...zon.co.uk, linux-kernel@...r.kernel.org, pbonzini@...hat.com,
seanjc@...gle.com, kvm@...r.kernel.org, reinette.chatre@...el.com,
isaku.yamahata@...el.com, dan.j.williams@...el.com, ashish.kalra@....com,
nik.borisov@...e.com, chao.gao@...el.com, sagis@...gle.com
Subject: Re: [PATCH v4 0/7] TDX host: kexec/kdump support
On 7/17/25 16:46, Kai Huang wrote:
> This series is the latest attempt to support kexec on TDX host following
> Dave's suggestion to use a percpu boolean to control WBINVD during
> kexec.
>
> Hi Boris/Tom,
>
> As requested, I added the first patch to cleanup the last two 'unsigned
> int' parameters of the relocate_kernel() into one 'unsigned int' and pass
> flags instead. The patch 2 (patch 1 in v3) also gets updated based on
> that. Would you help to review? Thanks.
>
> I tested that both normal kexec and preserve_context kexec works (using
> the tools/testing/selftests/kexec/test_kexec_jump.sh). But I don't have
> SME capable machine to test.
>
> Hi Tom, I added your Reviewed-by and Tested-by in the patch 2 anyway
> since I believe the change is trivial and straightforward). But due to
> the cleanup patch, I appreciate if you can help to test the first two
> patches again. Thanks a lot!
Everything is working, Thanks!
Tom
>
> v3 -> v4:
> - Rebase to latest tip/master.
> - Add a cleanup patch to consolidate relocate_kernel()'s last two
> function parameters -- Boris.
> - Address comments received -- please see individual patches.
> - Collect tags (Tom, Rick, binbin).
>
> v3: https://lore.kernel.org/kvm/cover.1750934177.git.kai.huang@intel.com/
>
> v2 -> v3 (all trivial changes):
>
> - Rebase on latest tip/master
> - change to use __always_inline for do_seamcall() in patch 2
> - Update patch 2 (changelog and code comment) to remove the sentence
> which says "not all SEAMCALLs generate dirty cachelines of TDX
> private memory but just treat all of them do." -- Dave.
> - Add Farrah's Tested-by for all TDX patches.
>
> The v2 had one informal RFC patch appended to show "some optimization"
> which can move WBINVD from the kexec phase to an early stage in KVM.
> Paolo commented and Acked that patch (thanks!), so this v3 made that
> patch as a formal one (patch 6). But technically it is not absolutely
> needed in this series but can be done in the future.
>
> More history info can be found in v2:
>
> https://lore.kernel.org/lkml/cover.1746874095.git.kai.huang@intel.com/
>
> === More information ===
>
> TDX private memory is memory that is encrypted with private Host Key IDs
> (HKID). If the kernel has ever enabled TDX, part of system memory
> remains TDX private memory when kexec happens. E.g., the PAMT (Physical
> Address Metadata Table) pages used by the TDX module to track each TDX
> memory page's state are never freed once the TDX module is initialized.
> TDX guests also have guest private memory and secure-EPT pages.
>
> After kexec, the new kernel will have no knowledge of which memory page
> was used as TDX private page and can use all memory as regular memory.
>
> 1) Cache flush
>
> Per TDX 1.5 base spec "8.6.1.Platforms not Using ACT: Required Cache
> Flush and Initialization by the Host VMM", to support kexec for TDX, the
> kernel needs to flush cache to make sure there's no dirty cachelines of
> TDX private memory left over to the new kernel (when the TDX module
> reports TDX_FEATURES.CLFLUSH_BEFORE_ALLOC as 1 in the global metadata for
> the platform). The kernel also needs to make sure there's no more TDX
> activity (no SEAMCALL) after cache flush so that no new dirty cachelines
> of TDX private memory are generated.
>
> SME has similar requirement. SME kexec support uses WBINVD to do the
> cache flush. WBINVD is able to flush cachelines associated with any
> HKID. Reuse the WBINVD introduced by SME to flush cache for TDX.
>
> Currently the kernel explicitly checks whether the hardware supports SME
> and only does WBINVD if true. Instead of adding yet another TDX
> specific check, this series uses a percpu boolean to indicate whether
> WBINVD is needed on that CPU during kexec.
>
> 2) Reset TDX private memory using MOVDIR64B
>
> The TDX spec (the aforementioned section) also suggests the kernel
> *should* use MOVDIR64B to clear TDX private page before the kernel
> reuses it as regular one.
>
> However, in reality the situation can be more flexible. Per TDX 1.5
> base spec ("Table 16.2: Non-ACT Platforms Checks on Memory Reads in Ci
> Mode" and "Table 16.3: Non-ACT Platforms Checks on Memory Reads in Li
> Mode"), the read/write to TDX private memory using shared KeyID without
> integrity check enabled will not poison the memory and cause machine
> check.
>
> Note on the platforms with ACT (Access Control Table), there's no
> integrity check involved thus no machine check is possible to happen due
> to memory read/write using different KeyIDs.
>
> KeyID 0 (TME key) doesn't support integrity check. This series chooses
> to NOT reset TDX private memory but leave TDX private memory as-is to the
> new kernel. As mentioned above, in practice it is safe to do so.
>
> 3) One limitation
>
> If the kernel has ever enabled TDX, after kexec the new kernel won't be
> able to use TDX anymore. This is because when the new kernel tries to
> initialize TDX module it will fail on the first SEAMCALL due to the
> module has already been initialized by the old kernel.
>
> More (non-trivial) work will be needed for the new kernel to use TDX,
> e.g., one solution is to just reload the TDX module from the location
> where BIOS loads the TDX module (/boot/efi/EFI/TDX/). This series
> doesn't cover this, but leave this as future work.
>
> 4) Kdump support
>
> This series also enables kdump with TDX, but no special handling is
> needed for crash kexec (except turning on the Kconfig option):
>
> - kdump kernel uses reserved memory from the old kernel as system ram,
> and the old kernel will never use the reserved memory as TDX memory.
> - /proc/vmcore contains TDX private memory pages. It's meaningless to
> read them, but it doesn't do any harm either.
>
> 5) TDX "partial write machine check" erratum
>
> On the platform with TDX erratum, a partial write (a write transaction
> of less than a cacheline lands at memory controller) to TDX private
> memory poisons that memory, and a subsequent read triggers machine
> check. On those platforms, the kernel needs to reset TDX private memory
> before jumping to the new kernel otherwise the new kernel may see
> unexpected machine check.
>
> The kernel currently doesn't track which page is TDX private memory.
> It's not trivial to reset TDX private memory. For simplicity, this
> series simply disables kexec/kdump for such platforms. This can be
> enhanced in the future.
>
>
>
> Kai Huang (7):
> x86/kexec: Consolidate relocate_kernel() function parameters
> x86/sme: Use percpu boolean to control WBINVD during kexec
> x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL
> x86/kexec: Disable kexec/kdump on platforms with TDX partial write
> erratum
> x86/virt/tdx: Remove the !KEXEC_CORE dependency
> x86/virt/tdx: Update the kexec section in the TDX documentation
> KVM: TDX: Explicitly do WBINVD when no more TDX SEAMCALLs
>
> Documentation/arch/x86/tdx.rst | 14 ++++-----
> arch/x86/Kconfig | 1 -
> arch/x86/include/asm/kexec.h | 12 ++++++--
> arch/x86/include/asm/processor.h | 2 ++
> arch/x86/include/asm/tdx.h | 31 +++++++++++++++++++-
> arch/x86/kernel/cpu/amd.c | 17 +++++++++++
> arch/x86/kernel/machine_kexec_64.c | 43 ++++++++++++++++++++++------
> arch/x86/kernel/process.c | 24 +++++++---------
> arch/x86/kernel/relocate_kernel_64.S | 30 +++++++++++--------
> arch/x86/kvm/vmx/tdx.c | 12 ++++++++
> arch/x86/virt/vmx/tdx/tdx.c | 16 +++++++++--
> 11 files changed, 155 insertions(+), 47 deletions(-)
>
>
> base-commit: e180b3a224cb519388c2f61ca7bc1eaf94cec1fb
Powered by blists - more mailing lists