linux-kernel - Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <sjhioktjzegjmyuaisde7ui7lsrhnolx6yjmikhhwlxxfba5bh@ss6igliiimas>
Date: Mon, 11 Aug 2025 07:31:09 +0100
From: "kas@...nel.org" <kas@...nel.org>
To: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
Cc: "pbonzini@...hat.com" <pbonzini@...hat.com>, 
	"seanjc@...gle.com" <seanjc@...gle.com>, "dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>, 
	"Gao, Chao" <chao.gao@...el.com>, "bp@...en8.de" <bp@...en8.de>, 
	"Huang, Kai" <kai.huang@...el.com>, "x86@...nel.org" <x86@...nel.org>, 
	"mingo@...hat.com" <mingo@...hat.com>, "Zhao, Yan Y" <yan.y.zhao@...el.com>, 
	"tglx@...utronix.de" <tglx@...utronix.de>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>, 
	"linux-coco@...ts.linux.dev" <linux-coco@...ts.linux.dev>, "Yamahata, Isaku" <isaku.yamahata@...el.com>, 
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT

On Fri, Aug 08, 2025 at 11:18:40PM +0000, Edgecombe, Rick P wrote:
> On Mon, 2025-06-09 at 22:13 +0300, Kirill A. Shutemov wrote:
> > Dynamic PAMT enabling in kernel
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > 
> > Kernel maintains refcounts for every 2M regions with two helpers
> > tdx_pamt_get() and tdx_pamt_put().
> > 
> > The refcount represents number of users for the PAMT memory in the region.
> > Kernel calls TDH.PHYMEM.PAMT.ADD on 0->1 transition and
> > TDH.PHYMEM.PAMT.REMOVE on transition 1->0.
> > 
> > The function tdx_alloc_page() allocates a new page and ensures that it is
> > backed by PAMT memory. Pages allocated in this manner are ready to be used
> > for a TD. The function tdx_free_page() frees the page and releases the
> > PAMT memory for the 2M region if it is no longer needed.
> > 
> > PAMT memory gets allocated as part of TD init, VCPU init, on populating
> > SEPT tree and adding guest memory (both during TD build and via AUG on
> > accept). Splitting 2M page into 4K also requires PAMT memory.
> > 
> > PAMT memory removed on reclaim of control pages and guest memory.
> > 
> > Populating PAMT memory on fault and on split is tricky as kernel cannot
> > allocate memory from the context where it is needed. These code paths use
> > pre-allocated PAMT memory pools.
> > 
> > Previous attempt on Dynamic PAMT enabling
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > 
> > The initial attempt at kernel enabling was quite different. It was built
> > around lazy PAMT allocation: only trying to add a PAMT page pair if a
> > SEAMCALL fails due to a missing PAMT and reclaiming it based on hints
> > provided by the TDX module.
> > 
> > The motivation was to avoid duplicating the PAMT memory refcounting that
> > the TDX module does on the kernel side.
> > 
> > This approach is inherently more racy as there is no serialization of
> > PAMT memory add/remove against SEAMCALLs that add/remove memory for a TD.
> > Such serialization would require global locking, which is not feasible.
> > 
> > This approach worked, but at some point it became clear that it could not
> > be robust as long as the kernel avoids TDX_OPERAND_BUSY loops.
> > TDX_OPERAND_BUSY will occur as a result of the races mentioned above.
> > 
> > This approach was abandoned in favor of explicit refcounting.
> 
> On closer inspection this new solution also has global locking. It
> opportunistically checks to see if there is already a refcount, but otherwise
> when a PAMT page actually has to be added there is a global spinlock while the
> PAMT add/remove SEAMCALL is made. I guess this is going to get taken somewhere
> around once per 512 4k private pages, but when it does it has some less ideal
> properties:
>  - Cache line bouncing of the lock between all TDs on the host
>  - An global exclusive lock deep inside the TDP MMU shared lock fault path
>  - Contend heavily when two TD's shutting down at the same time?
> 
> As for why not only do the lock as a backup option like the kick+lock solution
> in KVM, the problem would be losing the refcount race and ending up with a PAMT
> page getting released early.
> 
> As far as TDX module locking is concerned (i.e. BUSY error codes from pamt
> add/remove), it seems this would only happen if pamt add/remove operate
> simultaneously on the same 2MB HPA region. That is completely prevented by the
> refcount and global lock, but it's a bit heavyweight. It prevents simultaneously
> adding totally separate 2MB regions when we only would need to prevent
> simultaneously operating on the same 2MB region.
> 
> I don't see any other reason for the global spin lock, Kirill was that it? Did
> you consider also adding a lock per 2MB region, like the refcount? Or any other
> granularity of lock besides global? Not saying global is definitely the wrong
> choice, but seems arbitrary if I got the above right.

We have discussed this before[1]. Global locking is problematic when you
actually hit contention. Let's not complicate things until we actually
see it. I failed to demonstrate contention without huge pages. With huge
pages it is even more dubious that we ever see it.

[1] https://lore.kernel.org/all/4bb2119a-ff6d-42b6-acf4-86d87b0e9939@intel.com/

-- 
Kiryl Shutsemau / Kirill A. Shutemov