[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250807093950.4395-1-yan.y.zhao@intel.com>
Date: Thu, 7 Aug 2025 17:39:50 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: pbonzini@...hat.com,
seanjc@...gle.com
Cc: linux-kernel@...r.kernel.org,
kvm@...r.kernel.org,
x86@...nel.org,
rick.p.edgecombe@...el.com,
dave.hansen@...el.com,
kas@...nel.org,
tabba@...gle.com,
ackerleytng@...gle.com,
quic_eberman@...cinc.com,
michael.roth@....com,
david@...hat.com,
vannapurve@...gle.com,
vbabka@...e.cz,
thomas.lendacky@....com,
pgonda@...gle.com,
zhiquan1.li@...el.com,
fan.du@...el.com,
jun.miao@...el.com,
ira.weiny@...el.com,
isaku.yamahata@...el.com,
xiaoyao.li@...el.com,
binbin.wu@...ux.intel.com,
chao.p.peng@...el.com,
yan.y.zhao@...el.com
Subject: [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory
Hi,
This is RFC v2 series to support huge pages in TDX, currently focusing on
supporting 2MB huge pages only. (code available at [2]).
The main goal for RFC v2 is to have the TDX side feedback ready.
Tip folks, there are some SEAMCALL wrapper changes in this series, but we
need to have some discussion on the KVM side to figure out what it needs
still. Please feel free to ignore it for now."
4 opens in RFC v1
====
Open 1: How to pass guest's ACCEPT level info.
The TDX guest can affect KVM's mapping level of a fault due to the
current implementation of the guest ACCEPT operation in the TDX
module. If KVM maps at a higher level than the guest's ACCEPT
level, repeated faults are generated, prompting KVM to split the
huge mapping to match the ACCEPT level. To prevent these repeated
faults and subsequent splitting (at least for Linux TDs), it's
efficient to pass the guest's ACCEPT level info to the KVM MMU core
to restrict the mapping level of a fault.
Prior to RFC v1, this info was conveyed by modifying each fault's
error code [10][11].
In RFC v1, the info was passed through the hook
private_max_mapping_level, checking a per-vCPU
violation_request_level inside TDX [12].
In RFC v2, Sean and Rick suggested a global way [17] (vs a
fault-by-fault per-vCPU way) to specify the info. This involves
setting the disallow_lpage's GUEST_INHIBIT bit in a slot's
lpage_infos above the specified guest ACCEPT level, indicating that
the guest wants to prevent a specific GFN from being mapped at
those higher levels (Patches 13 and 14). This approach helps
prevent potential subtle bugs that could arise from different
ACCEPT levels specified by different vCPUs.
Open 2: Do we need to support huge pages on non-Linux TDs and the split in
fault path.
The current TDX module requires tdh_mem_range_block() to be invoked
before each tdh_mem_page_demote(). It leads to the complexity of
handling BUSY errors under a shared mmu_lock, since if a BUSY error
is returned from tdh_mem_page_demote(), TDX must call
tdh_mem_range_unblock() (which may also return a BUSY error) before
passing the error to the KVM MMU core.
For Linux TDs, guest ACCEPT operations usually occur before KVM
populates the mappings, except when KVM pre-faults certain
mappings. Therefore, to avoid splitting in the fault path for Linux
TDs, RFC v1 required prefetch faults to be mapped at 4KB and KVM's
mapping level to honor the guest's ACCEPT level.
For non-Linux TDs, where guest ACCEPT operations may occur after
KVM has populated mappings, KVM's mapping level has no way to honor
the not-yet-specified guest's ACCEPT level. If the guest later
accepts at a level smaller than KVM's mapping level, splitting in
the fault path may be necessary. In RFC v1, splitting in the fault
path was ignored, with the expectation that the guest would never
accept at a lower level than KVM's mapping level. This expectation
was later proven incorrect. Forcing a KVM's mapping to be at 4KB
before the guest's ACCEPT level is available would disable huge
pages for non-Linux TDs as the TD's later ACCEPT operations at a
higher level would return error TDX_PAGE_SIZE_MISMATCH to the
guest.
In RFC v2, the fault path splitting issue is addressed by having
TDX's EPT violation handler:
(1) explicitly split previous huge mapping under write mmu_lock,
(2) set GUEST_INHIBIT bit in a slot's lpage_info above the
specified guest ACCEPT level,
before invoking KVM MMU core's kvm_mmu_page_fault().
This mechanism allows support of huge pages for non-Linux TDs and
also removes the 4KB restriction on pre-fault mappings for Linux
TDs in RFC v1.
It's a TODO to support split in the fault path once new TDX module
supporting blockless tdh_mem_page_demote() is available. After
that, step (1) would be unnecessary and step (2) could be optimized
to be under shared mmu_lock to enhance performance.
Open 3: How to handle the TDX_INTERRUPTED_RESTARTABLE error from
tdh_mem_page_demote().
Currently, the TDX module returns TDX_INTERRUPTED_RESTARTABLE when
it receives interrupts (including NMIs) during the SEAMCALL
tdh_mem_page_demote(). However, a new TDX module is in planning,
which will disable the interrupt checking (including for NMIs)
during SEAMCALL tdh_mem_page_demote() for basic TDX (w/ or w/o
DPAMT).
As a result, the loop for endlessly retrying
TDX_INTERRUPTED_RESTARTABLE has been removed in RFC v2.
The retry logic remains in the code tree [2] for testing purpose
with current TDX module.
Open 4: How to handle unmap failures.
To enable guest_memfd to support in-place conversion between shared
and private memory [5], TDX is required not to hold refcount of the
private pages allocated from guest_memfd. (see details in patch 6).
Without TDX holding extra folio refcount for memory that are still
mapped in S-EPT, the guest_memfd may have freed a page while it's
still mapped in the S-EPT, since the current KVM unmap operation
does not return errors to its callers. (For TDX, the error could
arise from failures in SEAMCALLs tdh_mem_page_remove(),
tdh_phymem_page_wbinvd_hkid(), tdh_phymem_page_reclaim()).
Several approaches have been explored. (Refer to the links in patch
6). Due to the complexity and given that S-EPT zapping failure is
currently only possible due to bugs in the KVM or TDX module, which
are very rare in a production kernel, this RFC v2 adopts a
straightforward approach -- simply avoiding holding the page
reference count in TDX and generating a KVM_BUG_ON() on S-EPT unmap
failure without propagating the error back to guest_memfd or
notifying guest_memfd through any out-of-band method. To be robust
against bugs, the user can enable panic_on_warn as normal.
Main changes from RFC v1
====
- Switched from using 2MB THP series [4] to HugeTLB based in-place
conversion series v2 [5] as the huge page allocator. Patches 17 is the
updated implementation in guest_memfd for punch_hole and
private-to-shared conversion.
- Rebased onto DPAMT series [7] and pulled patches from [7.1] for DPAMT
related changes in TDX huge pages. (Patches 19-23).
- Addressed the 4 opens in RFC v1.
- Updated tdh_mem_page_aug(), tdh_phymem_page_wbinvd_hkid(),
tdh_phymem_page_reclaim(), introduced tdx_clear_folio() to support
"folios" rather than "pages".
- Renamed kvm_split_boundary_leafs() to kvm_split_cross_boundary_leafs()
and made it to leverage the existing tdp_mmu_split_huge_pages_root().
The API is now available to both direct and mirror roots and also usable
under shared mmu_lock.
- To address the lock issues present in the gmem series [4][5], this RFC v2
is now based on the RFC patch "KVM: TDX: Decouple TDX init mem region
from kvm_gmem_populate()" [9].
(Will rebase to Sean's proposed fixes ([13]-[16]) later when the formal
patches are available. It's not expected to introduce more than minimal
change to the implementation of the TDX huge page series.)
Design
====
1. guest_memfd
Although the TDX huge pages series uses guest_memfd as the huge page
allocator, it does not depend on the implementation specifics of
guest_memfd. Therefore, aside from the guest_memfd-specific code in
patch 17, which calls kvm_split_cross_boundary_leafs() to split private
mappings for hole punching and private-to-shared conversion, the TDX
huge pages series can function with either 2MB THP guest_memfd [4] (by
omitting patch 17) or HugeTLB-based in-place conversion guest_memfd [5].
2. Page size during TD build time
It's forced to 4KB in the last patch (patch 23), because
- tdh_mem_page_add() only adds private pages at 4KB.
- No need to allow merging or splitting during TD build time.
3. Page size during TD runtime
kvm_tdp_mmu_page_fault() can create mappings up to the 2MB level, as
determined by TDX's private_max_mapping_level hook. This function can be
triggered by host pre-faults, guest ACCEPT operations, or guest GPA
accesses.
KVM's mapping size can be influenced by the guest's ACCEPT size. When
faults that carry the guest's ACCEPT level occur, the guest ACCEPT level
info is passed to the KVM MMU core by setting the GUEST_INHIBIT bit in a
slot's lpage_info above the guest ACCEPT level. Consequently, the KVM
MMU core will map up to the guest's ACCEPT level.
Currently, the GUEST_INHIBIT bit is set in a one-off manner under a
write mmu_lock. Once set, it will not be unset, and the GFN will not be
allowed to be mapped at higher levels, even if the mappings are zapped
and re-accepted by the guest at higher levels later. This approach
simplifies the code and has been tested to have a minor impact on the
huge page map count with a typical Linux TD. Future optimizations may
include setting the GUEST_INHIBIT bit under a shared mmu_lock or
unsetting the GUEST_INHIBIT bit upon zapping.
4. Page splitting (page demotion)
Page splitting can occur due to (1) private-to-shared conversion,
(2) hole punching in guest_memfd, or (3) a guest's ACCEPT operation at
a lower level than KVM's created mapping.
With the current TDX module, splitting huge mappings in S-EPT requires
executing tdh_mem_range_block(), tdh_mem_track(), kicking off vCPUs,
and tdh_mem_page_demote() in sequence. If DPAMT is involved,
tdh_phymem_pamt_add() is also required (see bullet 5).
Currently, only scenario (3) may trigger page splitting under a shared
mmu_lock. However, to avoid the complexity of handling BUSY errors when
tdh_mem_page_demote() depends on tdh_mem_range_block(), TDX page
splitting under a shared mmu_lock is not supported in this series (patch
11 directly returns -EOPNOTSUPP). This is also prevented by triggering
the splitting under an exclusive mmu_lock on a fault caused by (3) and
setting the GUEST_INHIBIT bit in a slot's lpage_info above the guest
ACCEPT level. Supporting split in the fault path is a TODO for when a
new TDX module that supports blockless tdh_mem_page_demote() becomes
available.
With an exclusive kvm->mmu_lock, TDX_OPERAND_BUSY is handled similarly
to removing a private page, i.e., by kicking off all vCPUs and retrying,
which should succeed on the second attempt.
TDX_INTERRUPTED_RESTARTABLE will not be returned from
tdh_mem_page_demote() for basic TDX in the new TDX module that is
currently in planning. Therefore, this error is not handled in this RFC
v2.
5. Dynamic PAMT
Currently, DPAMT's involvement with TDX huge pages is limited to page
splitting.
This part of design depends on the outcome of the DPAMT series [7]
design discussions.
The call stack based on the current design is as follows:
kvm_split_cross_boundary_leafs
kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs
tdp_mmu_split_cross_boundary_leafs
tdp_mmu_split_huge_pages_root
1.1 tdp_mmu_alloc_sp_for_split
1.2 topup_mirror_caches <-------------------------------
2.tdp_mmu_split_huge_page |
tdp_mmu_link_sp |
tdp_mmu_iter_set_spte |
tdp_mmu_set_spte |
split_external_spt |
kvm_x86_call(split_external_spt) |
tdx_sept_split_private_spt |
3.1 BLOCK, TRACK |
3.2 PAMT_ADD for the adding page --|
table page |
3.3 alloc pamt pages for the 2MB --
guest page and DEMOTE
1.2 allocates PAMT pages in the per-VM KVM MMU memory cache for both
the adding page table page and the 2MB guest page to be demoted.
(Patch 21).
3.2 and 3.3 retrieve the PAMT pages from the per-VM KVM MMU memory
cache and perform tdh_phymem_pamt_add() and tdh_mem_page_demote().
Holding of the exclusive mmu_lock after 1.2 and re-checking the
count of free pages ensure that 3.2 and 3.3 can safely acquire
enough pages in an atomic context.
6. Page merging (page promotion)
Promotion is disallowed (in patch 7), because
- The current TDX module requires all 4KB leafs to be either all PENDING
or all ACCEPTED before a successful promotion to 2MB. This requirement
prevents successful page merging after partially converting a 2MB
range from private to shared and then back to private, which is the
primary scenario necessitating page promotion.
- tdh_mem_page_promote() depends on tdh_mem_range_block() in the current
TDX module. Consequently, handling BUSY errors is complex, as page
merging typically occurs in the fault path under a shared mmu_lock.
Patches Layout
====
- Patches 01-05: Update SEAMCALL wrappers or page clearing helper.
- Patch 06: Drop holding gmem page refcount in TDX to facilitate
guest_memfd in-place conversion [5].
- Patch 07: Disallow page merging for TDX.
- Patches 08-11: Enable KVM MMU core to propagate splitting requests to TDX
and provide the corresponding implementation in TDX.
- Patch 12: Introduce a higher level API to split pages.
- Patches 13-14: Split and inhibit huge pages in EPT violation handler.
- Patches 15-17: Split huge pages on page conversion/punch hole.
- Patches 18-22: Dynamic PAMT related changes for TDX huge pages.
- Patch 23: Turn on 2MB huge pages.
Testing
===
The kernel code is available at [2].
The code base includes
- kvm-x86-next-2025.07.21 (with shutdown optimization [8])
- guest_memfd code [6]
- DPAMT v2 [7]
- TDX selftests out of [6].
TDX huge pages can be tested with the KVM selftest tdx_vm_huge_page in [2]
or with the matching QEMU [3].
The 2MB mapping count can be checked via "/sys/kernel/debug/kvm/pages_2m".
Huge pages has been verified to work with DPAMT and shutdown optimization
on TDX module version TDX_1.5.20.00.887, though further testing is needed
(e.g., the mmu stress test).
Thanks
Yan
[1] RFC v1: https://lore.kernel.org/all/20250424030033.32635-1-yan.y.zhao@intel.com
[2] Kernel code: https://github.com/intel/tdx/tree/huge_page_v2
[3] QEMU code: https://github.com/intel-staging/qemu-tdx/tree/gmem-tdx-hugepage-v2
[4] 2M THP: https://lore.kernel.org/all/20241212063635.712877-1-michael.roth@amd.com
[5] 1G: https://lore.kernel.org/all/cover.1747264138.git.ackerleytng@google.com
[6] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2
[7] DPAMT: https://lore.kernel.org/all/20250609191340.2051741-1-kirill.shutemov@linux.intel.com
[7.1] git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git tdx/dpamt-huge
[8] shutdown: https://lore.kernel.org/all/20250718181541.98146-1-seanjc@google.com
[9] decouple: https://lore.kernel.org/all/20250703062641.3247-1-yan.y.zhao@intel.com
[10] https://lore.kernel.org/all/4d61104bff388a081ff8f6ae4ac71e05a13e53c3.1708933624.git.isaku.yamahata@intel.com/
[11]https://lore.kernel.org/all/3d2a6bfb033ee1b51f7b875360bd295376c32b54.1708933624.git.isaku.yamahata@intel.com/
[12] https://lore.kernel.org/all/20250424030713.403-1-yan.y.zhao@intel.com
[13] https://lore.kernel.org/all/aG_pLUlHdYIZ2luh@google.com/
[14] https://lore.kernel.org/all/aHEwT4X0RcfZzHlt@google.com/
[15] https://lore.kernel.org/lkml/20250613005400.3694904-2-michael.roth@amd.com
[16] https://lore.kernel.org/all/20250729225455.670324-15-seanjc@google.com
[17] https://lore.kernel.org/all/a6ffe23fb97e64109f512fa43e9f6405236ed40a.camel@intel.com
Edgecombe, Rick P (1):
KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror
root
Isaku Yamahata (1):
KVM: x86/tdp_mmu: Alloc external_spt page for mirror page table
splitting
Kirill A. Shutemov (5):
x86/virt/tdx: Do not perform cache flushes unless CLFLUSH_BEFORE_ALLOC
is set
KVM: TDX: Pass down pfn to split_external_spt()
KVM: TDX: Handle Dynamic PAMT in tdh_mem_page_demote()
KVM: TDX: Preallocate PAMT pages to be used in split path
KVM: TDX: Handle Dynamic PAMT on page split
Xiaoyao Li (1):
x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
Yan Zhao (15):
x86/tdx: Enhance tdh_mem_page_aug() to support huge pages
x86/tdx: Enhance tdh_phymem_page_wbinvd_hkid() to invalidate huge
pages
KVM: TDX: Introduce tdx_clear_folio() to clear huge pages
x86/tdx: Enhance tdh_phymem_page_reclaim() to support huge pages
KVM: TDX: Do not hold page refcount on private guest pages
KVM: x86/tdp_mmu: Add split_external_spt hook called during write
mmu_lock
KVM: TDX: Enable huge page splitting under write kvm->mmu_lock
KVM: x86: Reject splitting huge pages under shared mmu_lock for mirror
root
KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()
KVM: x86: Introduce hugepage_set_guest_inhibit()
KVM: TDX: Split and inhibit huge mappings if a VMExit carries level
info
KVM: Change the return type of gfn_handler_t() from bool to int
KVM: x86: Split cross-boundary mirror leafs for
KVM_SET_MEMORY_ATTRIBUTES
KVM: guest_memfd: Split for punch hole and private-to-shared
conversion
KVM: TDX: Turn on PG_LEVEL_2M after TD is RUNNABLE
arch/arm64/kvm/mmu.c | 8 +-
arch/loongarch/kvm/mmu.c | 8 +-
arch/mips/kvm/mmu.c | 6 +-
arch/powerpc/kvm/book3s.c | 4 +-
arch/powerpc/kvm/e500_mmu_host.c | 8 +-
arch/riscv/kvm/mmu.c | 12 +-
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 7 +
arch/x86/include/asm/tdx.h | 18 ++-
arch/x86/kvm/mmu.h | 3 +
arch/x86/kvm/mmu/mmu.c | 85 ++++++++---
arch/x86/kvm/mmu/tdp_mmu.c | 208 ++++++++++++++++++++++---
arch/x86/kvm/mmu/tdp_mmu.h | 3 +
arch/x86/kvm/vmx/tdx.c | 238 ++++++++++++++++++++++-------
arch/x86/kvm/vmx/tdx_arch.h | 3 +
arch/x86/virt/vmx/tdx/tdx.c | 102 ++++++++++---
arch/x86/virt/vmx/tdx/tdx.h | 1 +
include/linux/kvm_host.h | 14 +-
virt/kvm/guest_memfd.c | 142 ++++++++++-------
virt/kvm/kvm_main.c | 37 +++--
20 files changed, 691 insertions(+), 217 deletions(-)
--
2.43.2
Powered by blists - more mailing lists