linux-kernel - Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aJqgosNUjrCfH_WN@google.com>
Date: Mon, 11 Aug 2025 19:02:10 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Rick P Edgecombe <rick.p.edgecombe@...el.com>
Cc: "kas@...nel.org" <kas@...nel.org>, Chao Gao <chao.gao@...el.com>, 
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, Kai Huang <kai.huang@...el.com>, 
	"bp@...en8.de" <bp@...en8.de>, "x86@...nel.org" <x86@...nel.org>, "mingo@...hat.com" <mingo@...hat.com>, 
	Yan Y Zhao <yan.y.zhao@...el.com>, 
	"dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>, "pbonzini@...hat.com" <pbonzini@...hat.com>, 
	"tglx@...utronix.de" <tglx@...utronix.de>, 
	"linux-coco@...ts.linux.dev" <linux-coco@...ts.linux.dev>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>, 
	Isaku Yamahata <isaku.yamahata@...el.com>
Subject: Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT

On Mon, Aug 11, 2025, Rick P Edgecombe wrote:
> On Mon, 2025-08-11 at 07:31 +0100, kas@...nel.org wrote:
> > > I don't see any other reason for the global spin lock, Kirill was that
> > > it?  Did you consider also adding a lock per 2MB region, like the
> > > refcount? Or any other granularity of lock besides global? Not saying
> > > global is definitely the wrong choice, but seems arbitrary if I got the
> > > above right.
> > 
> > We have discussed this before[1]. Global locking is problematic when you
> > actually hit contention. Let's not complicate things until we actually
> > see it. I failed to demonstrate contention without huge pages. With huge
> > pages it is even more dubious that we ever see it.
> > 
> > [1]
> > https://lore.kernel.org/all/4bb2119a-ff6d-42b6-acf4-86d87b0e9939@intel.com/
> 
> Ah, I see.
> 
> I just did a test of simultaneously starting 10 VMs with 16GB of ram (non huge

How many vCPUs?  And were the VMs actually accepting/faulting all 16GiB?

There's also a noisy neighbor problem lurking.  E.g. malicious/buggy VM spams
private<=>shared conversions and thus interferes with PAMT allocations for other
VMs.

> pages) and then shutting them down. I saw 701 contentions on startup, and 53
> more on shutdown. Total wait time 2ms. Not horrible but not theoretical either.
> But it probably wasn't much of a cacheline bouncing worse case.

Isn't the SEAMCALL done while holding the spinlock?  I assume the latency of the
SEAMCALL is easily the long pole in the flow.

> And I guess this is on my latest changes not this exact v2, but it shouldn't
> have changed.
> 
> But hmm, it seems Dave's objection about maintaining the lock allocations would
> apply to the refcounts too? But the hotplug concerns shouldn't actually be an
> issue for TDX because they gets rejected if the allocations are not already
> there. So complexity of a per-2MB lock should be minimal, at least
> incrementally. The difference seems more about memory use vs performance.
> 
> What gives me pause is in the KVM TDX work we have really tried hard to not take
> exclusive locks in the shared MMU lock path. Admittedly that wasn't backed by
> hard numbers.

Maybe not for TDX, but we have lots and lots of hard numbers for why taking mmu_lock
for write is problematic.  Even if TDX VMs don't exhibit the same patterns *today*
as "normal" VMs, i.e. don't suffer the same performance blips, nothing guarantees
that will always hold true.
 
> But an enormous amount of work went into lettings KVM faults happen under the
> shared lock for normal VMs. So on one hand, yes it's premature optimization.
> But on the other hand, it's a maintainability concern about polluting the
> existing way things work in KVM with special TDX properties.
> 
> I think we need to at least call out loudly that the decision was to go with the
> simplest possible solution, and the impact to KVM. I'm not sure what Sean's
> opinion is, but I wouldn't want him to first learn of it when he went digging
> and found a buried global spin lock in the fault path.

Heh, too late, I saw it when this was first posted.  And to be honest, my initial
reaction was very much "absolutely not" (though Rated R, not PG).  Now that I've
had time to think things through, I'm not _totally_ opposed to having a spinlock
in the page fault path, but my overall sentiment remains the same.

For mmu_lock and related SPTE operations, I was super adamant about not taking
exclusive locks because based on our experience with the TDP MMU, converting flows
from exclusive to shared is usually significantly more work than developing code
for "shared mode" straightaway (and you note above, that wasn't trivial for TDX).
And importantly, those code paths were largely solved problems.  I.e. I didn't
want to get into a situation where TDX undid the parallelization of the TDP MMU,
and then had to add it back after the fact.

I think the same holds true here.  I'm not completely opposed to introducing a
spinlock, but I want to either have a very high level of confidence that the lock
won't introduce jitter/delay (I have low confidence on this front, at least in
the proposed patches), or have super clear line of sight to making the contention
irrelevant, without having to rip apart the code.

My biggest question at this point is: why is all of this being done on-demand?
IIUC, we swung from "allocate all PAMT_4K pages upfront" to "allocate all PAMT_4K
pages at the last possible moment".  Neither of those seems ideal.

E.g. for things like TDCS pages and to some extent non-leaf S-EPT pages, on-demand
PAMT management seems reasonable.  But for PAMTs that are used to track guest-assigned
memory, which is the vaaast majority of PAMT memory, why not hook guest_memfd?
I.e. setup PAMT crud when guest_memfd is populated, not when the memory is mapped
into the guest.  That way setups that cares about guest boot time can preallocate
guest_memfd in order to get the PAMT stuff out of the way.

You could do the same thing by prefaulting guest memory, but TDX has limitations
there, and I see very little value in precisely reclaiming PAMT memory when a
leaf S-EPT is zapped, i.e. when a page is converted from private=>shared.  As
above, that's just asking for noisy neighbor issues.

The complaints with static PAMT are that it required burning 0.4% of memory even
if the host isn't actively running TDX VMs.  Burning 0.4% of the memory assigned
to a guest, regardless of whether it's map private or shared, seems acceptable,
and I think would give us a lot more flexibility in avoiding locking issues.

Similarly, we could bind a PAMT to non-leaf S-EPT pages during mmu_topup_memory_caches(),
i.e. when arch.mmu_external_spt_cache is filled.  Then there would be no need for
a separate vcpu->arch.pamt_page_cache, and more work would be done outside of
mmu_lock.  Freeing SPTs would still be done under mmu_lock (I think), but that
should be a much rarer operation.