[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <sjhioktjzegjmyuaisde7ui7lsrhnolx6yjmikhhwlxxfba5bh@ss6igliiimas>
Date: Mon, 11 Aug 2025 07:31:09 +0100
From: "kas@...nel.org" <kas@...nel.org>
To: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
Cc: "pbonzini@...hat.com" <pbonzini@...hat.com>,
"seanjc@...gle.com" <seanjc@...gle.com>, "dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>,
"Gao, Chao" <chao.gao@...el.com>, "bp@...en8.de" <bp@...en8.de>,
"Huang, Kai" <kai.huang@...el.com>, "x86@...nel.org" <x86@...nel.org>,
"mingo@...hat.com" <mingo@...hat.com>, "Zhao, Yan Y" <yan.y.zhao@...el.com>,
"tglx@...utronix.de" <tglx@...utronix.de>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
"linux-coco@...ts.linux.dev" <linux-coco@...ts.linux.dev>, "Yamahata, Isaku" <isaku.yamahata@...el.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT
On Fri, Aug 08, 2025 at 11:18:40PM +0000, Edgecombe, Rick P wrote:
> On Mon, 2025-06-09 at 22:13 +0300, Kirill A. Shutemov wrote:
> > Dynamic PAMT enabling in kernel
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >
> > Kernel maintains refcounts for every 2M regions with two helpers
> > tdx_pamt_get() and tdx_pamt_put().
> >
> > The refcount represents number of users for the PAMT memory in the region.
> > Kernel calls TDH.PHYMEM.PAMT.ADD on 0->1 transition and
> > TDH.PHYMEM.PAMT.REMOVE on transition 1->0.
> >
> > The function tdx_alloc_page() allocates a new page and ensures that it is
> > backed by PAMT memory. Pages allocated in this manner are ready to be used
> > for a TD. The function tdx_free_page() frees the page and releases the
> > PAMT memory for the 2M region if it is no longer needed.
> >
> > PAMT memory gets allocated as part of TD init, VCPU init, on populating
> > SEPT tree and adding guest memory (both during TD build and via AUG on
> > accept). Splitting 2M page into 4K also requires PAMT memory.
> >
> > PAMT memory removed on reclaim of control pages and guest memory.
> >
> > Populating PAMT memory on fault and on split is tricky as kernel cannot
> > allocate memory from the context where it is needed. These code paths use
> > pre-allocated PAMT memory pools.
> >
> > Previous attempt on Dynamic PAMT enabling
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >
> > The initial attempt at kernel enabling was quite different. It was built
> > around lazy PAMT allocation: only trying to add a PAMT page pair if a
> > SEAMCALL fails due to a missing PAMT and reclaiming it based on hints
> > provided by the TDX module.
> >
> > The motivation was to avoid duplicating the PAMT memory refcounting that
> > the TDX module does on the kernel side.
> >
> > This approach is inherently more racy as there is no serialization of
> > PAMT memory add/remove against SEAMCALLs that add/remove memory for a TD.
> > Such serialization would require global locking, which is not feasible.
> >
> > This approach worked, but at some point it became clear that it could not
> > be robust as long as the kernel avoids TDX_OPERAND_BUSY loops.
> > TDX_OPERAND_BUSY will occur as a result of the races mentioned above.
> >
> > This approach was abandoned in favor of explicit refcounting.
>
> On closer inspection this new solution also has global locking. It
> opportunistically checks to see if there is already a refcount, but otherwise
> when a PAMT page actually has to be added there is a global spinlock while the
> PAMT add/remove SEAMCALL is made. I guess this is going to get taken somewhere
> around once per 512 4k private pages, but when it does it has some less ideal
> properties:
> - Cache line bouncing of the lock between all TDs on the host
> - An global exclusive lock deep inside the TDP MMU shared lock fault path
> - Contend heavily when two TD's shutting down at the same time?
>
> As for why not only do the lock as a backup option like the kick+lock solution
> in KVM, the problem would be losing the refcount race and ending up with a PAMT
> page getting released early.
>
> As far as TDX module locking is concerned (i.e. BUSY error codes from pamt
> add/remove), it seems this would only happen if pamt add/remove operate
> simultaneously on the same 2MB HPA region. That is completely prevented by the
> refcount and global lock, but it's a bit heavyweight. It prevents simultaneously
> adding totally separate 2MB regions when we only would need to prevent
> simultaneously operating on the same 2MB region.
>
> I don't see any other reason for the global spin lock, Kirill was that it? Did
> you consider also adding a lock per 2MB region, like the refcount? Or any other
> granularity of lock besides global? Not saying global is definitely the wrong
> choice, but seems arbitrary if I got the above right.
We have discussed this before[1]. Global locking is problematic when you
actually hit contention. Let's not complicate things until we actually
see it. I failed to demonstrate contention without huge pages. With huge
pages it is even more dubious that we ever see it.
[1] https://lore.kernel.org/all/4bb2119a-ff6d-42b6-acf4-86d87b0e9939@intel.com/
--
Kiryl Shutsemau / Kirill A. Shutemov
Powered by blists - more mailing lists