[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250609191340.2051741-1-kirill.shutemov@linux.intel.com>
Date: Mon, 9 Jun 2025 22:13:28 +0300
From: "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>
To: pbonzini@...hat.com,
seanjc@...gle.com,
dave.hansen@...ux.intel.com
Cc: rick.p.edgecombe@...el.com,
isaku.yamahata@...el.com,
kai.huang@...el.com,
yan.y.zhao@...el.com,
chao.gao@...el.com,
tglx@...utronix.de,
mingo@...hat.com,
bp@...en8.de,
kvm@...r.kernel.org,
x86@...nel.org,
linux-coco@...ts.linux.dev,
linux-kernel@...r.kernel.org,
"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>
Subject: [PATCHv2 00/12] TDX: Enable Dynamic PAMT
This patchset enables Dynamic PAMT in TDX. Please review.
Previously, we thought it can get upstreamed after huge page support, but
huge pages require support on guestmemfd side which might take time to hit
upstream. Dynamic PAMT doesn't have dependencies.
The patchset can be found here:
git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git tdx/dpamt
==========================================================================
The Physical Address Metadata Table (PAMT) holds TDX metadata for
physical memory and must be allocated by the kernel during TDX module
initialization.
The exact size of the required PAMT memory is determined by the TDX
module and may vary between TDX module versions, but currently it is
approximately 0.4% of the system memory. This is a significant
commitment, especially if it is not known upfront whether the machine
will run any TDX guests.
The Dynamic PAMT feature reduces static PAMT allocations. PAMT_1G and
PAMT_2M levels are still allocated on TDX module initialization, but the
PAMT_4K level is allocated dynamically, reducing static allocations to
approximately 0.004% of the system memory.
PAMT memory is dynamically allocated as pages gain TDX protections.
It is reclaimed when TDX protections have been removed from all
pages in a contiguous area.
Dynamic PAMT support in TDX module
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Dynamic PAMT is a TDX feature that allows VMM to allocate PAMT_4K as
needed. PAMT_1G and PAMT_2M are still allocated statically at the time of
TDX module initialization. At init stage allocation of PAMT_4K is replaced
with PAMT_PAGE_BITMAP which currently requires one bit of memory per 4k.
VMM is responsible for allocating and freeing PAMT_4K. There's a couple of
new SEAMCALLs for this: TDH.PHYMEM.PAMT.ADD and TDH.PHYMEM.PAMT.REMOVE.
They add/remove PAMT memory in form of page pair. There's no requirement
for these pages to be contiguous.
Page pair supplied via TDH.PHYMEM.PAMT.ADD will cover specified 2M region.
It allows any 4K from the region to be usable by TDX module.
With Dynamic PAMT, a number of SEAMCALLs can now fail due to missing PAMT
memory (TDX_MISSING_PAMT_PAGE_PAIR):
- TDH.MNG.CREATE
- TDH.MNG.ADDCX
- TDH.VP.ADDCX
- TDH.VP.CREATE
- TDH.MEM.PAGE.ADD
- TDH.MEM.PAGE.AUG
- TDH.MEM.PAGE.DEMOTE
- TDH.MEM.PAGE.RELOCATE
Basically, if you supply memory to a TD, this memory has to backed by PAMT
memory.
Once no TD uses the 2M range, the PAMT page pair can be reclaimed with
TDH.PHYMEM.PAMT.REMOVE.
TDX module track PAMT memory usage and can give VMM a hint that PAMT
memory can be removed. Such hint is provided from all SEAMCALLs that
removes memory from TD:
- TDH.MEM.SEPT.REMOVE
- TDH.MEM.PAGE.REMOVE
- TDH.MEM.PAGE.PROMOTE
- TDH.MEM.PAGE.RELOCATE
- TDH.PHYMEM.PAGE.RECLAIM
With Dynamic PAMT, TDH.MEM.PAGE.DEMOTE takes PAMT page pair as additional
input to populate PAMT_4K on split. TDH.MEM.PAGE.PROMOTE returns no longer
needed PAMT page pair.
PAMT memory is global resource and not tied to a specific TD. TDX modules
maintains PAMT memory in a radix tree addressed by physical address. Each
entry in the tree can be locked with shared or exclusive lock. Any
modification of the tree requires exclusive lock.
Any SEAMCALL that takes explicit HPA as an argument will walk the tree
taking shared lock on entries. It required to make sure that the page
pointed by HPA is of compatible type for the usage.
TDCALLs don't take PAMT locks as none of the take HPA as an argument.
Dynamic PAMT enabling in kernel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Kernel maintains refcounts for every 2M regions with two helpers
tdx_pamt_get() and tdx_pamt_put().
The refcount represents number of users for the PAMT memory in the region.
Kernel calls TDH.PHYMEM.PAMT.ADD on 0->1 transition and
TDH.PHYMEM.PAMT.REMOVE on transition 1->0.
The function tdx_alloc_page() allocates a new page and ensures that it is
backed by PAMT memory. Pages allocated in this manner are ready to be used
for a TD. The function tdx_free_page() frees the page and releases the
PAMT memory for the 2M region if it is no longer needed.
PAMT memory gets allocated as part of TD init, VCPU init, on populating
SEPT tree and adding guest memory (both during TD build and via AUG on
accept). Splitting 2M page into 4K also requires PAMT memory.
PAMT memory removed on reclaim of control pages and guest memory.
Populating PAMT memory on fault and on split is tricky as kernel cannot
allocate memory from the context where it is needed. These code paths use
pre-allocated PAMT memory pools.
Previous attempt on Dynamic PAMT enabling
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The initial attempt at kernel enabling was quite different. It was built
around lazy PAMT allocation: only trying to add a PAMT page pair if a
SEAMCALL fails due to a missing PAMT and reclaiming it based on hints
provided by the TDX module.
The motivation was to avoid duplicating the PAMT memory refcounting that
the TDX module does on the kernel side.
This approach is inherently more racy as there is no serialization of
PAMT memory add/remove against SEAMCALLs that add/remove memory for a TD.
Such serialization would require global locking, which is not feasible.
This approach worked, but at some point it became clear that it could not
be robust as long as the kernel avoids TDX_OPERAND_BUSY loops.
TDX_OPERAND_BUSY will occur as a result of the races mentioned above.
This approach was abandoned in favor of explicit refcounting.
v2:
- Drop phys_prepare/clenup. Use kvm_get_running_vcpu() to reach per-VCPU PAMT
memory pool from TDX code instead.
- Move code that allocates/frees PAMT out of KVM;
- Allocate refcounts per-memblock, not per-TDMR;
- Fix free_pamt_metadata() for machines without Dynamic PAMT;
- Fix refcounting in tdx_pamt_put() error path;
- Export functions where they are used;
- Consolidate TDX error handling code;
- Add documentation for Dynamic PAMT;
- Mark /proc/meminfo patch [NOT-FOR-UPSTREAM];
Kirill A. Shutemov (12):
x86/tdx: Consolidate TDX error handling
x86/virt/tdx: Allocate page bitmap for Dynamic PAMT
x86/virt/tdx: Allocate reference counters for PAMT memory
x86/virt/tdx: Add tdx_alloc/free_page() helpers
KVM: TDX: Allocate PAMT memory in __tdx_td_init()
KVM: TDX: Allocate PAMT memory in tdx_td_vcpu_init()
KVM: TDX: Preallocate PAMT pages to be used in page fault path
KVM: TDX: Handle PAMT allocation in fault path
KVM: TDX: Reclaim PAMT memory
[NOT-FOR-UPSTREAM] x86/virt/tdx: Account PAMT memory and print it in
/proc/meminfo
x86/virt/tdx: Enable Dynamic PAMT
Documentation/x86: Add documentation for TDX's Dynamic PAMT
Documentation/arch/x86/tdx.rst | 108 ++++++
arch/x86/coco/tdx/tdx.c | 6 +-
arch/x86/include/asm/kvm_host.h | 2 +
arch/x86/include/asm/set_memory.h | 3 +
arch/x86/include/asm/tdx.h | 40 ++-
arch/x86/include/asm/tdx_errno.h | 96 +++++
arch/x86/include/asm/tdx_global_metadata.h | 1 +
arch/x86/kvm/mmu/mmu.c | 7 +
arch/x86/kvm/vmx/tdx.c | 102 ++++--
arch/x86/kvm/vmx/tdx.h | 1 -
arch/x86/kvm/vmx/tdx_errno.h | 40 ---
arch/x86/mm/Makefile | 2 +
arch/x86/mm/meminfo.c | 11 +
arch/x86/mm/pat/set_memory.c | 2 +-
arch/x86/virt/vmx/tdx/tdx.c | 380 +++++++++++++++++++-
arch/x86/virt/vmx/tdx/tdx.h | 5 +-
arch/x86/virt/vmx/tdx/tdx_global_metadata.c | 3 +
virt/kvm/kvm_main.c | 1 +
18 files changed, 702 insertions(+), 108 deletions(-)
create mode 100644 arch/x86/include/asm/tdx_errno.h
delete mode 100644 arch/x86/kvm/vmx/tdx_errno.h
create mode 100644 arch/x86/mm/meminfo.c
--
2.47.2
Powered by blists - more mailing lists