[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250918232224.2202592-1-rick.p.edgecombe@intel.com>
Date: Thu, 18 Sep 2025 16:22:08 -0700
From: Rick Edgecombe <rick.p.edgecombe@...el.com>
To: kas@...nel.org,
bp@...en8.de,
chao.gao@...el.com,
dave.hansen@...ux.intel.com,
isaku.yamahata@...el.com,
kai.huang@...el.com,
kvm@...r.kernel.org,
linux-coco@...ts.linux.dev,
linux-kernel@...r.kernel.org,
mingo@...hat.com,
pbonzini@...hat.com,
seanjc@...gle.com,
tglx@...utronix.de,
x86@...nel.org,
yan.y.zhao@...el.com,
vannapurve@...gle.com
Cc: rick.p.edgecombe@...el.com
Subject: [PATCH v3 00/16] TDX: Enable Dynamic PAMT
Hi,
This is 3rd revision of Dynamic PAMT, which is a new feature that reduces
memory use of TDX.
On v2 (as well as in PUCK) there was some discussion of the
refcount/locking design tradeoffs for Dynamic PAMT. In v2, I’ve basically
gone through and tried to make the details around this more reviewable.
The basic solution is the same as v2, with the changes more about moving
code around or splitting implementations/optimizations. I’m hoping with
this v3 we can close on whether that approach is good enough or not.
I think the patch quality is in ok shape, but still need some review.
Maintainers, please feel free to let us go through this v3 for lower level
code issues, but I would appreciate engagement on the overall design.
Another open still is performance testing, besides the bit about excluding
contention of the global lock.
Lastly, Yan raised some last minute doubts internally about TDX module
locking contention. I’m not sure there is a problem, but we can come to an
agreement as part of the review.
PAMT Background
===============
The TDX module needs to keep data about each physical page it uses. It
requires the kernel to give it memory to use for this purpose, called
PAMT. Internally it wants space for metadata for each page *and* each page
size. That is, if a page is mapped at 2MB in a TD, it doesn’t spread this
tracking across the allocations it uses for 4KB page size usage of the
same physical memory. It is designed to use a separate allocation for
this.
So each memory region that the TDX module could use (aka TDMRs) has three
of these PAMT allocations. They are all allocated during the global TDX
initialization, regardless of if the memory is actually getting used for a
TD. It uses up approximately 0.4% of system memory.
Dynamic PAMT (DPAMT)
====================
Fortunately, only using physical memory for areas of an address space that
are actually in use is a familiar problem in system engineering, with a
well trodden solution: page tables. It would be great if TDX could do
something like that for PAMT. This is basically the idea for Dynamic PAMT.
However, there are some design aspects that could be surprising for anyone
expecting “PAMT, but page tables”. The following describes these
differences.
DPAMT Levels
------------
Dynamic PAMT focuses on the page size level that has the biggest PAMT
allocation - 4KB page size. Since the 2MB and 1GB levels are smaller
allocations, they are left as the fixed arrays, as in normal PAMT. But the
4KB page sizes are actually not fully dynamic either, the TDX module still
requires a physically contiguous space for tracking each 4KB page in a
TDMR. But this space shrinks significantly, to currently 1 bit per page.
Page Sizes
----------
Like normal PAMT, Dynamic PAMT wants to provide a way for the TDX module
to have separate PAMT tracking for different page sizes. But unlike normal
PAMT, it does not seamlessly switch between the 2MB and 4KB page sizes
without VMM action. It wants the PAMT mapped at the same level that the
underlying TDX memory is using. In practice this means the VMM needs to
update the PAMT depending on whether secure EPT pages are to be mapped at
2MB or 4KB.
While demote/promote have internal handling for these updates (and support
passing or returning the PAMT pages involved), PAGE.ADD/AUG don’t. Instead
two new SEAMCALLs are provided for the VMM to configure the PAMT to the
intended page size (i.e. 4KB if the page will be mapped at 4KB):
TDH.PHYMEM.PAMT.ADD/REMOVE.
While some operations on TD memory can internally handle the PAMT
adjustments, the opposite is not true. That is changes in PAMT don’t try
to automatically change private S-EPT page sizes. Instead an attempt to
remove 4KB size PAMT pages while will fail if any of the covering range are
in use.
Concurrency
-----------
For every 2MB physical range there could be many 4KB pages used by TDX
(obviously). But each of those only needs one set of PAMT pages added. So
on the first use of the 2MB region the DPAMT needs to be installed, and
after none of the pages in that range are in use, it needs to be freed.
The TDX module actually does track how many pages are using each 2MB range
and gives hints for refcount of zero. But it is incremented when the 2MB
region use actually starts. Like:
1. TDH.PHYMEM.PAMT.ADD (backing for
2MB range X,
for page Y)
2. TDH.MEM.PAGE.AUG (page Y) <- Increments refcount for X.
3. TDH.MEM.PAGE.REMOVE (page Y) <- Decrements refcount for X,
gives hint in return value.
4. TDH.PHYMEM.PAMT.REMOVE (range X) <- Remove, check that each 4KB
page in X is free.
The internal refcount is tricky to use because of the window of time
between TDH.PHYMEM.PAMT.ADD and TDH.MEM.PAGE.AUG. The PAMT.ADD adds the
backing, but doesn’t tell the TDX module a VMM intends to map it. Consider
the range X that includes page Y and Z, for an implementation that tries
to use these hints:
CPU 0 CPU 1
TDH.PHYMEM.PAMT.ADD X (returns
already
mapped)
TDH.MEM.PAGE.REMOVE (Y) (returns
refcount 0
hint)
TDH.PHYMEM.PAMT.REMOVE (X)
TDH.MEM.PAGE.AUG Z (fail)
So the TDX module’s DPAMT refcounts don't track what the VMM intends to do
with the page, only what it has already done. This leaves a window that
needs to be covered.
TDX module locking
------------------
Inside the TDX module there are various locks. The TDX module does not
wait when it encounters contention, instead it returns a BUSY error code.
This leaves the VMM with an option to either loop around the SEAMCALL, or
return an error to the calling code. In some cases in Linux there is not
an option to return an error from the operation making the SEAMCALL. To
avoid potentially an indeterminable amount of retries, it opts to kick all
the threads associated with the TD out of the TDX module and retry. This
retry operation is fairly easy to do from with KVM.
Since PAMT is a global resource, this means that this lock contention
could be from any TD. For normal PAMT, the exclusive locks locks are only
taken at the 4KB page size granularity. In practice, this means any page
that is not shared between TDs won’t have to worry about contention.
However, for DPAMT this changes. The TDH.PHYMEM.PAMT.ADD/REMOVE calls take
a PAMT lock at 2MB granularity. If two calls try to operate on the same
2MB region at the same time, one will get the BUSY error code.
Linux Challenges
================
So for dynamic PAMT, Linux needs to:
1. Allocate a different fixed sized allocation for each TDMR, for the 4KB
page size (the 1 bit per page bitmap instead of the normal larger
allocation)
2. Allocate DPAMT for control structures.
3. Allocate DPAMT 4KB page backings for all TD private memory (which is
currently only 4KB) and S-EPT page tables.
1 is easy, just query the new size when DPAMT is in use and use that
instead of the regular 4KB PAMT size. The bitmap allocation is even passed
in the same field to the TDX module. If you takeaway the TDX docs naming
around the bitmap, it’s like a buffer that changes size.
For 2 and 3, there is a lot to consider. For 2, it is relatively easy as
long as we want to install PAMT on demand since these pages come straight
from the page allocator and not guestmemfd.
For 3, upstream we currently only have 4KB pages, which means we could
ignore a lot of the specifics about matching page sizes - there is only
one. However, TDX huge pages is also in progress. So we should avoid a
design that would need to be redone immediately.
Installing 4KB DPAMT backings
-----------------------------
Some TDX pages are used for TD control structures. These need to have new
code to install 4KB DPAMT backings. But the main problem is how do this
for private memory.
Linux needs to add the DPAMT backing private memory before the page gets
mapped in the TD. Doing so inside the KVM MMU call paths adds
complications around the BUSY error code, as described above. It would be
tempting to install DPAMT pages from a guestmemfd callback. This could
happen outside the KVM MMU locks.
But there are three complications with this. One is that in case of 2MB
pages, the guest can control the page size. This means, even if 4KB page
sizes are installed automatically, KVM would have to handle edge cases of
PAMT adjustments at runtime anyway. For example memslot deletion and
re-adding would trigger a zap of huge pages that are later remapped at
4KB. This could *maybe* be worked around by some variant of this technique
[0].
Another wrinkle is that Vishal@...gle has expressed a strong interest in
saving PAMT memory at runtime in the case of 2MB TD private memory. He
wants to support a use case where most TD memory is mapped at 2MB, so he
wants to avoid the overhead of a worse case allocation that assumes all
memory will be mapped at 4KB.
Finally, pre-installing DPAMT pages before the fault doesn’t help with
mapping DPAMT pages for the external (S-EPT) page tables that are
allocated for the fault. So some fault time logic is needed. We could
pre-install DPAMT backing for the external page table cache, which would
happen outside of the MMU lock. This would free us from having to update
DPAMT inside MMU lock. But it would not free KVM from having to do
anything around DPAMT during a fault.
Three non-show stopping issues tilts things towards using a fault time
DPAMT installation approach for private memory.
Avoiding duplicate attempts to add/remove DPAMT
-----------------------------------------------
As covered above, there isn’t a refcount in the TDX module that we can
use. Using the hints returned by TDH.MEM.PAGE.AUG/REMOVE was tried in v1,
and Kirill was unable to both overcome the races and make nice with new
failure scenarios around DPAMT locks. So in v2 refcounts were allocated on
the kernel side for each 2MB range covered by a TDMR (a range the TDX
module might use). This adds some small memory overhead of 0.0002%, which
is small compared to the 0.4% to 0.004% savings of Dynamic PAMT. The major
downside is code complexity. These allocations are still large and involve
managing vmalloc space. The v2 solution reserves a vmalloc space to cover
the entire physical address space, and only maps pages for any ranges that
are covered by a TDMR.
Avoiding contention
-------------------
As covered above, Linux needs to deal with getting BUSY error codes here.
The allocation/mapping paths for all these pages can already handle
failure, but the trickier case is the removal paths. As previously sparred
with during the base series, TDX module expects to be able to fail these
calls, but the kernel does not. Further, the refcounts can not act as a
race free lock on their own. So some synchronization is needed before
actually removing the DPAMT backings.
V2 of the series includes a global lock to be used around actual
installation/removal of the DPAMT backing, combined with opportunistic
checking outside the lock to avoid taking it most of the time. In testing,
booting 10 16GB TDs, the lock only hit contention 1136 times, with 4ms
waiting. This is very small for an operation that took 60s of wall time.
So despite being an (ugly) global lock, the actual impact was small. It
will probably further be reduced in the case of huge pages, where most of
the time 4KB DPAMT installation will not be necessary.
Updates in v3
=============
Besides incorporating the feedback and general cleanups, the major design
change was around how DPAMT backing pages are allocated in the fault path.
Allocating DPAMT pages in v2
----------------------------
In v2, there was a specific page cache added in the generic x86 KVM MMU
for DPAMT pages. This was needed because much of the fault happens inside
a spinlock. By the time the fault handler knows whether it needs to
install DPAMT backing, it can no longer allocate pages. This is a common
pattern in the KVM MMU, and so pages are pre-allocated before taking the
MMU spinlock. This way they can be used later if needed.
But KVM’s page cache infrastructure is not easy to pass into the arch/86
code and inside the global spin lock where the pages would be consumed. So
instead it passed a pointer to a function inside KVM that it can call to
allocate pages from the KVM page cache component. Since the control
structures need DPAMT backing installed outside of the fault, the arch/x86
code also had logic to allocate pages directly from the page allocator.
Further, there were various resulting intermediate lists that had to be
marshaled through the DPMAT allocation paths.
This was all a bit complicated to me.
Updates in v3
-------------
V3 redid the areas described above to try to simplify it.
The KVM MMU already has knowledge that TDX needs special memory allocated
for S-EPT. From the fault handlers perspective, this could be seen as just
more memory of the same type. So v3 just turns the external page table
allocation to an x86 op, and provides another op to allocate from it. This
is done as refactoring. Then when dynamic PAMT is added the extra pages
can just be added from within the x86 op in TDX code.
For removing the function pointer callback scheme, the external page
tables are switched to a dirt simple linked list based page cache. This is
somewhat reinventing the wheel, but KVM’s kvm_mmu_memory_cache operations
are not easy to expose to the core kernel, and also TDX doesn’t need much
of the fanciness of initial values and things like that. Building it out
of the kernels linked list is enough code reuse.
Today the TDX module needs 2 pages for 2MB region of 4KB size dynamic PAMT
backing. It would be tempting to just pass two pages in, but the TDX
module exposes the number of Dynamic PAMT pages it needs as a metadata
value. So the size is technically variable. To handle this, the design
just passes in the simple TDX page cache list in the calls that might need
to allocate dynamic PAMT.
Considerations for v4
=====================
This solution seems workable. It isolated Dynamic PAMT to TDX code, and
doesn’t introduce any extra constraints to generic x86 KVM code. But the
refcounts and global lock in arch/x86 side of TDX are still ugly.
There has been some internal discussion about pursuing various schemes to
avoid this. But before a potential redesign, I wanted to share the current
version. Both to get feedback on the updates, and so we can consider how
“good enough” the current design is.
Testing and branch
==================
Testing is a bit light currently. Just TDX selftests, simple TDX Linux
guest boot. The branch is here:
https://github.com/intel/tdx/commits/dpamt_v3/
Based on kvm_x86/next (603c090664d3)
[0] https://lore.kernel.org/kvm/20250807094423.4644-1-yan.y.zhao@intel.com/
Kirill A. Shutemov (13):
x86/tdx: Move all TDX error defines into <asm/shared/tdx_errno.h>
x86/tdx: Add helpers to check return status codes
x86/virt/tdx: Allocate page bitmap for Dynamic PAMT
x86/virt/tdx: Allocate reference counters for PAMT memory
x86/virt/tdx: Improve PAMT refcounters allocation for sparse memory
x86/virt/tdx: Add tdx_alloc/free_page() helpers
x86/virt/tdx: Optimize tdx_alloc/free_page() helpers
KVM: TDX: Allocate PAMT memory for TD control structures
KVM: TDX: Allocate PAMT memory for vCPU control structures
KVM: TDX: Handle PAMT allocation in fault path
KVM: TDX: Reclaim PAMT memory
x86/virt/tdx: Enable Dynamic PAMT
Documentation/x86: Add documentation for TDX's Dynamic PAMT
Rick Edgecombe (3):
x86/virt/tdx: Simplify tdmr_get_pamt_sz()
KVM: TDX: Add x86 ops for external spt cache
x86/virt/tdx: Add helpers to allow for pre-allocating pages
Documentation/arch/x86/tdx.rst | 21 +
arch/x86/coco/tdx/tdx.c | 6 +-
arch/x86/include/asm/kvm-x86-ops.h | 2 +
arch/x86/include/asm/kvm_host.h | 11 +-
arch/x86/include/asm/shared/tdx.h | 1 +
arch/x86/include/asm/shared/tdx_errno.h | 109 +++++
arch/x86/include/asm/tdx.h | 76 ++-
arch/x86/include/asm/tdx_global_metadata.h | 1 +
arch/x86/kvm/mmu/mmu.c | 4 +-
arch/x86/kvm/mmu/mmu_internal.h | 2 +-
arch/x86/kvm/vmx/tdx.c | 157 ++++--
arch/x86/kvm/vmx/tdx.h | 3 +-
arch/x86/kvm/vmx/tdx_errno.h | 40 --
arch/x86/virt/vmx/tdx/tdx.c | 505 +++++++++++++++++---
arch/x86/virt/vmx/tdx/tdx.h | 5 +-
arch/x86/virt/vmx/tdx/tdx_global_metadata.c | 3 +
16 files changed, 766 insertions(+), 180 deletions(-)
create mode 100644 arch/x86/include/asm/shared/tdx_errno.h
delete mode 100644 arch/x86/kvm/vmx/tdx_errno.h
--
2.51.0
Powered by blists - more mailing lists