lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8374a887-9bde-c7c0-ace2-0afe22f1f616@amd.com>
Date: Mon, 21 Jul 2025 09:50:47 -0500
From: Tom Lendacky <thomas.lendacky@....com>
To: Kai Huang <kai.huang@...el.com>, dave.hansen@...el.com, bp@...en8.de,
 tglx@...utronix.de, peterz@...radead.org, mingo@...hat.com, hpa@...or.com
Cc: x86@...nel.org, kas@...nel.org, rick.p.edgecombe@...el.com,
 dwmw@...zon.co.uk, linux-kernel@...r.kernel.org, pbonzini@...hat.com,
 seanjc@...gle.com, kvm@...r.kernel.org, reinette.chatre@...el.com,
 isaku.yamahata@...el.com, dan.j.williams@...el.com, ashish.kalra@....com,
 nik.borisov@...e.com, chao.gao@...el.com, sagis@...gle.com
Subject: Re: [PATCH v4 0/7] TDX host: kexec/kdump support

On 7/21/25 08:08, Tom Lendacky wrote:
> On 7/17/25 16:46, Kai Huang wrote:
>> This series is the latest attempt to support kexec on TDX host following
>> Dave's suggestion to use a percpu boolean to control WBINVD during
>> kexec.
>>
>> Hi Boris/Tom,
>>
>> As requested, I added the first patch to cleanup the last two 'unsigned
>> int' parameters of the relocate_kernel() into one 'unsigned int' and pass
>> flags instead.  The patch 2 (patch 1 in v3) also gets updated based on
>> that.  Would you help to review?  Thanks.
>>
>> I tested that both normal kexec and preserve_context kexec works (using
>> the tools/testing/selftests/kexec/test_kexec_jump.sh).  But I don't have
>> SME capable machine to test.
>>
>> Hi Tom, I added your Reviewed-by and Tested-by in the patch 2 anyway
>> since I believe the change is trivial and straightforward).  But due to
>> the cleanup patch, I appreciate if you can help to test the first two
>> patches again.  Thanks a lot!
> 
> Everything is working, Thanks!

See my comments in patch #1. I didn't test with context preservation, so
that bit was never set. If it was, I think things would have failed.

Thanks,
Tom

> 
> Tom
> 
>>
>> v3 -> v4:
>>  - Rebase to latest tip/master.
>>  - Add a cleanup patch to consolidate relocate_kernel()'s last two
>>    function parameters -- Boris.
>>  - Address comments received -- please see individual patches.
>>  - Collect tags (Tom, Rick, binbin).
>>
>>  v3: https://lore.kernel.org/kvm/cover.1750934177.git.kai.huang@intel.com/
>>
>> v2 -> v3 (all trivial changes):
>>
>>  - Rebase on latest tip/master
>>    - change to use __always_inline for do_seamcall() in patch 2
>>  - Update patch 2 (changelog and code comment) to remove the sentence
>>    which says "not all SEAMCALLs generate dirty cachelines of TDX
>>    private memory but just treat all of them do."  -- Dave.
>>  - Add Farrah's Tested-by for all TDX patches.
>>
>> The v2 had one informal RFC patch appended to show "some optimization"
>> which can move WBINVD from the kexec phase to an early stage in KVM.
>> Paolo commented and Acked that patch (thanks!), so this v3 made that
>> patch as a formal one (patch 6).  But technically it is not absolutely
>> needed in this series but can be done in the future.
>>
>> More history info can be found in v2:
>>
>>  https://lore.kernel.org/lkml/cover.1746874095.git.kai.huang@intel.com/
>>
>> === More information ===
>>
>> TDX private memory is memory that is encrypted with private Host Key IDs
>> (HKID).  If the kernel has ever enabled TDX, part of system memory
>> remains TDX private memory when kexec happens.  E.g., the PAMT (Physical
>> Address Metadata Table) pages used by the TDX module to track each TDX
>> memory page's state are never freed once the TDX module is initialized.
>> TDX guests also have guest private memory and secure-EPT pages.
>>
>> After kexec, the new kernel will have no knowledge of which memory page
>> was used as TDX private page and can use all memory as regular memory.
>>
>> 1) Cache flush
>>
>> Per TDX 1.5 base spec "8.6.1.Platforms not Using ACT: Required Cache
>> Flush and Initialization by the Host VMM", to support kexec for TDX, the
>> kernel needs to flush cache to make sure there's no dirty cachelines of
>> TDX private memory left over to the new kernel (when the TDX module
>> reports TDX_FEATURES.CLFLUSH_BEFORE_ALLOC as 1 in the global metadata for
>> the platform).  The kernel also needs to make sure there's no more TDX
>> activity (no SEAMCALL) after cache flush so that no new dirty cachelines
>> of TDX private memory are generated.
>>
>> SME has similar requirement.  SME kexec support uses WBINVD to do the
>> cache flush.  WBINVD is able to flush cachelines associated with any
>> HKID.  Reuse the WBINVD introduced by SME to flush cache for TDX.
>>
>> Currently the kernel explicitly checks whether the hardware supports SME
>> and only does WBINVD if true.  Instead of adding yet another TDX
>> specific check, this series uses a percpu boolean to indicate whether
>> WBINVD is needed on that CPU during kexec.
>>
>> 2) Reset TDX private memory using MOVDIR64B
>>
>> The TDX spec (the aforementioned section) also suggests the kernel
>> *should* use MOVDIR64B to clear TDX private page before the kernel
>> reuses it as regular one.
>>
>> However, in reality the situation can be more flexible.  Per TDX 1.5
>> base spec ("Table 16.2: Non-ACT Platforms Checks on Memory Reads in Ci
>> Mode" and "Table 16.3: Non-ACT Platforms Checks on Memory Reads in Li
>> Mode"), the read/write to TDX private memory using shared KeyID without
>> integrity check enabled will not poison the memory and cause machine
>> check.
>>
>> Note on the platforms with ACT (Access Control Table), there's no
>> integrity check involved thus no machine check is possible to happen due
>> to memory read/write using different KeyIDs.
>>
>> KeyID 0 (TME key) doesn't support integrity check.  This series chooses
>> to NOT reset TDX private memory but leave TDX private memory as-is to the
>> new kernel.  As mentioned above, in practice it is safe to do so.
>>
>> 3) One limitation
>>
>> If the kernel has ever enabled TDX, after kexec the new kernel won't be
>> able to use TDX anymore.  This is because when the new kernel tries to
>> initialize TDX module it will fail on the first SEAMCALL due to the
>> module has already been initialized by the old kernel.
>>
>> More (non-trivial) work will be needed for the new kernel to use TDX,
>> e.g., one solution is to just reload the TDX module from the location
>> where BIOS loads the TDX module (/boot/efi/EFI/TDX/).  This series
>> doesn't cover this, but leave this as future work.
>>
>> 4) Kdump support
>>
>> This series also enables kdump with TDX, but no special handling is
>> needed for crash kexec (except turning on the Kconfig option):
>>
>>  - kdump kernel uses reserved memory from the old kernel as system ram,
>>    and the old kernel will never use the reserved memory as TDX memory.
>>  - /proc/vmcore contains TDX private memory pages.  It's meaningless to
>>    read them, but it doesn't do any harm either.
>>
>> 5) TDX "partial write machine check" erratum
>>
>> On the platform with TDX erratum, a partial write (a write transaction
>> of less than a cacheline lands at memory controller) to TDX private
>> memory poisons that memory, and a subsequent read triggers machine
>> check.  On those platforms, the kernel needs to reset TDX private memory
>> before jumping to the new kernel otherwise the new kernel may see
>> unexpected machine check.
>>
>> The kernel currently doesn't track which page is TDX private memory.
>> It's not trivial to reset TDX private memory.  For simplicity, this
>> series simply disables kexec/kdump for such platforms.  This can be
>> enhanced in the future.
>>
>>
>>
>> Kai Huang (7):
>>   x86/kexec: Consolidate relocate_kernel() function parameters
>>   x86/sme: Use percpu boolean to control WBINVD during kexec
>>   x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL
>>   x86/kexec: Disable kexec/kdump on platforms with TDX partial write
>>     erratum
>>   x86/virt/tdx: Remove the !KEXEC_CORE dependency
>>   x86/virt/tdx: Update the kexec section in the TDX documentation
>>   KVM: TDX: Explicitly do WBINVD when no more TDX SEAMCALLs
>>
>>  Documentation/arch/x86/tdx.rst       | 14 ++++-----
>>  arch/x86/Kconfig                     |  1 -
>>  arch/x86/include/asm/kexec.h         | 12 ++++++--
>>  arch/x86/include/asm/processor.h     |  2 ++
>>  arch/x86/include/asm/tdx.h           | 31 +++++++++++++++++++-
>>  arch/x86/kernel/cpu/amd.c            | 17 +++++++++++
>>  arch/x86/kernel/machine_kexec_64.c   | 43 ++++++++++++++++++++++------
>>  arch/x86/kernel/process.c            | 24 +++++++---------
>>  arch/x86/kernel/relocate_kernel_64.S | 30 +++++++++++--------
>>  arch/x86/kvm/vmx/tdx.c               | 12 ++++++++
>>  arch/x86/virt/vmx/tdx/tdx.c          | 16 +++++++++--
>>  11 files changed, 155 insertions(+), 47 deletions(-)
>>
>>
>> base-commit: e180b3a224cb519388c2f61ca7bc1eaf94cec1fb

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ