lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <pla54zy4z27df57uxmzuog26mddiezbwsyrurnjxivdkg5dibx@574tcxdgjru2>
Date: Thu, 15 May 2025 17:22:00 +0300
From: "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>
To: Sean Christopherson <seanjc@...gle.com>
Cc: pbonzini@...hat.com, rick.p.edgecombe@...el.com, 
	isaku.yamahata@...el.com, kai.huang@...el.com, yan.y.zhao@...el.com, tglx@...utronix.de, 
	mingo@...hat.com, bp@...en8.de, dave.hansen@...ux.intel.com, kvm@...r.kernel.org, 
	x86@...nel.org, linux-coco@...ts.linux.dev, linux-kernel@...r.kernel.org
Subject: Re: [RFC, PATCH 00/12] TDX: Enable Dynamic PAMT

On Wed, May 14, 2025 at 06:41:10AM -0700, Sean Christopherson wrote:
> On Fri, May 02, 2025, Kirill A. Shutemov wrote:
> > This RFC patchset enables Dynamic PAMT in TDX. It is not intended to be
> > applied, but rather to receive early feedback on the feature design and
> > enabling.
> 
> In that case, please describe the design, and specifically *why* you chose this
> particular design, along with the constraints and rules of dynamic PAMTs that
> led to that decision.  It would also be very helpful to know what options you
> considered and discarded, so that others don't waste time coming up with solutions
> that you already rejected.

Dynamic PAMT support in TDX module
==================================

Dynamic PAMT is a TDX feature that allows VMM to allocate PAMT_4K as
needed. PAMT_1G and PAMT_2M are still allocated statically at the time of
TDX module initialization. At init stage allocation of PAMT_4K is replaced
with PAMT_PAGE_BITMAP which currently requires one bit of memory per 4k.

VMM is responsible for allocating and freeing PAMT_4K. There's a pair of
new SEAMCALLs for it: TDH.PHYMEM.PAMT.ADD and TDH.PHYMEM.PAMT.REMOVE. They
add/remove PAMT memory in form of page pair. There's no requirement for
these pages to be contiguous.

Page pair supplied via TDH.PHYMEM.PAMT.ADD will cover specified 2M region.
It allows any 4K from the region to be usable by TDX module.

With Dynamic PAMT, a number of SEAMCALLs can now fail due to missing PAMT
memory (TDX_MISSING_PAMT_PAGE_PAIR):

 - TDH.MNG.CREATE
 - TDH.MNG.ADDCX 
 - TDH.VP.ADDCX
 - TDH.VP.CREATE
 - TDH.MEM.PAGE.ADD
 - TDH.MEM.PAGE.AUG 
 - TDH.MEM.PAGE.DEMOTE
 - TDH.MEM.PAGE.RELOCATE

Basically, if you supply memory to a TD, this memory has to backed by PAMT
memory.

Once no TD uses the 2M range, the PAMT page pair can be reclaimed with
TDH.PHYMEM.PAMT.REMOVE.

TDX module track PAMT memory usage and can give VMM a hint that PAMT
memory can be removed. Such hint is provided from all SEAMCALLs that
removes memory from TD:

 - TDH.MEM.SEPT.REMOVE
 - TDH.MEM.PAGE.REMOVE
 - TDH.MEM.PAGE.PROMOTE
 - TDH.MEM.PAGE.RELOCATE
 - TDH.PHYMEM.PAGE.RECLAIM

With Dynamic PAMT, TDH.MEM.PAGE.DEMOTE takes PAMT page pair as additional
input to populate PAMT_4K on split. TDH.MEM.PAGE.PROMOTE returns no longer
needed PAMT page pair.

PAMT memory is global resource and not tied to a specific TD. TDX modules
maintains PAMT memory in a radix tree addressed by physical address. Each
entry in the tree can be locked with shared or exclusive lock. Any
modification of the tree requires exclusive lock.

Any SEAMCALL that takes explicit HPA as an argument will walk the tree
taking shared lock on entries. It required to make sure that the page
pointed by HPA is of compatible type for the usage.

TDCALLs don't take PAMT locks as none of the take HPA as an argument.

Dynamic PAMT enabling in kernel
===============================

Kernel maintains refcounts for every 2M regions with two helpers
tdx_pamt_get() and tdx_pamt_put().

The refcount represents number of users for the PAMT memory in the region.
Kernel calls TDH.PHYMEM.PAMT.ADD on 0->1 transition and
TDH.PHYMEM.PAMT.REMOVE on transition 1->0.

PAMT memory gets allocated as part of TD init, VCPU init, on populating
SEPT tree and adding guest memory (both during TD build and via AUG on
accept).

PAMT memory removed on reclaim of control pages and guest memory.

Populating PAMT memory on fault is tricky as we cannot allocate memory
from the context where it is needed. I introduced a pair of kvm_x86_ops to
allocate PAMT memory from a per-VCPU pool from context where VCPU is still
around and free it on failuire. This flow will likely be reworked in next
versions.

Previous attempt on Dynamic PAMT enabling
=========================================

My initial kernel enabling attempt was quite different. I wanted to make
PAMT allocation lazy: only try to add PAMT page pair if a SEAMCALL fails
due to missing PAMT and reclaim it back based on hint provided by the TDX
module.

The motivation was to avoid duplication of PAMT memory refcounting that
TDX module does on kernel side.

This approach is inherently more racy as we don't serialize PAMT memory
add/remove against SEAMCALLs that uses add/remove memory for a TD. Such
serialization would require global locking which is no-go.

I made this approach work, but at some point I realized that it cannot be
robust as long as we want to avoid TDX_OPERAND_BUSY loops.
TDX_OPERAND_BUSY will pop up as result of the races I mentioned above.

I gave up on this approach and went with the current one which uses
explicit refcounting.


Brain dumped.

Let me know if anything is unclear.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ