linux-kernel - Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <cxww35wskgjssvb7l7xedu4dimjpxunzoadexeg2qcev6ch2kc@xeed2z7ivaxk>
Date: Tue, 12 Aug 2025 09:03:07 +0100
From: "kas@...nel.org" <kas@...nel.org>
To: Sean Christopherson <seanjc@...gle.com>
Cc: Rick P Edgecombe <rick.p.edgecombe@...el.com>, 
	Chao Gao <chao.gao@...el.com>, "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, 
	Kai Huang <kai.huang@...el.com>, "bp@...en8.de" <bp@...en8.de>, "x86@...nel.org" <x86@...nel.org>, 
	"mingo@...hat.com" <mingo@...hat.com>, Yan Y Zhao <yan.y.zhao@...el.com>, 
	"dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>, "pbonzini@...hat.com" <pbonzini@...hat.com>, 
	"tglx@...utronix.de" <tglx@...utronix.de>, "linux-coco@...ts.linux.dev" <linux-coco@...ts.linux.dev>, 
	"kvm@...r.kernel.org" <kvm@...r.kernel.org>, Isaku Yamahata <isaku.yamahata@...el.com>
Subject: Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT

On Mon, Aug 11, 2025 at 07:02:10PM -0700, Sean Christopherson wrote:
> On Mon, Aug 11, 2025, Rick P Edgecombe wrote:
> > On Mon, 2025-08-11 at 07:31 +0100, kas@...nel.org wrote:
> > > > I don't see any other reason for the global spin lock, Kirill was that
> > > > it?  Did you consider also adding a lock per 2MB region, like the
> > > > refcount? Or any other granularity of lock besides global? Not saying
> > > > global is definitely the wrong choice, but seems arbitrary if I got the
> > > > above right.
> > > 
> > > We have discussed this before[1]. Global locking is problematic when you
> > > actually hit contention. Let's not complicate things until we actually
> > > see it. I failed to demonstrate contention without huge pages. With huge
> > > pages it is even more dubious that we ever see it.
> > > 
> > > [1]
> > > https://lore.kernel.org/all/4bb2119a-ff6d-42b6-acf4-86d87b0e9939@intel.com/
> > 
> > Ah, I see.
> > 
> > I just did a test of simultaneously starting 10 VMs with 16GB of ram (non huge
> 
> How many vCPUs?  And were the VMs actually accepting/faulting all 16GiB?
> 
> There's also a noisy neighbor problem lurking.  E.g. malicious/buggy VM spams
> private<=>shared conversions and thus interferes with PAMT allocations for other
> VMs.
> 
> > pages) and then shutting them down. I saw 701 contentions on startup, and 53
> > more on shutdown. Total wait time 2ms. Not horrible but not theoretical either.
> > But it probably wasn't much of a cacheline bouncing worse case.
> 
> Isn't the SEAMCALL done while holding the spinlock?  I assume the latency of the
> SEAMCALL is easily the long pole in the flow.
> 
> > And I guess this is on my latest changes not this exact v2, but it shouldn't
> > have changed.
> > 
> > But hmm, it seems Dave's objection about maintaining the lock allocations would
> > apply to the refcounts too? But the hotplug concerns shouldn't actually be an
> > issue for TDX because they gets rejected if the allocations are not already
> > there. So complexity of a per-2MB lock should be minimal, at least
> > incrementally. The difference seems more about memory use vs performance.

I don't see jump to per-2MB locking remotely justified. We can scale
number of locks gradually with the amount of memory in the system: have
a power-of-2 set of locks and 2MB range to the lock with %.

Note that it is trivial thing to add later on and doesn't need to be
part of initial design.

> > What gives me pause is in the KVM TDX work we have really tried hard to not take
> > exclusive locks in the shared MMU lock path. Admittedly that wasn't backed by
> > hard numbers.
> 
> Maybe not for TDX, but we have lots and lots of hard numbers for why taking mmu_lock
> for write is problematic.  Even if TDX VMs don't exhibit the same patterns *today*
> as "normal" VMs, i.e. don't suffer the same performance blips, nothing guarantees
> that will always hold true.
>  
> > But an enormous amount of work went into lettings KVM faults happen under the
> > shared lock for normal VMs. So on one hand, yes it's premature optimization.
> > But on the other hand, it's a maintainability concern about polluting the
> > existing way things work in KVM with special TDX properties.
> > 
> > I think we need to at least call out loudly that the decision was to go with the
> > simplest possible solution, and the impact to KVM. I'm not sure what Sean's
> > opinion is, but I wouldn't want him to first learn of it when he went digging
> > and found a buried global spin lock in the fault path.
> 
> Heh, too late, I saw it when this was first posted.  And to be honest, my initial
> reaction was very much "absolutely not" (though Rated R, not PG).  Now that I've
> had time to think things through, I'm not _totally_ opposed to having a spinlock
> in the page fault path, but my overall sentiment remains the same.
> 
> For mmu_lock and related SPTE operations, I was super adamant about not taking
> exclusive locks because based on our experience with the TDP MMU, converting flows
> from exclusive to shared is usually significantly more work than developing code
> for "shared mode" straightaway (and you note above, that wasn't trivial for TDX).
> And importantly, those code paths were largely solved problems.  I.e. I didn't
> want to get into a situation where TDX undid the parallelization of the TDP MMU,
> and then had to add it back after the fact.
> 
> I think the same holds true here.  I'm not completely opposed to introducing a
> spinlock, but I want to either have a very high level of confidence that the lock
> won't introduce jitter/delay (I have low confidence on this front, at least in
> the proposed patches), or have super clear line of sight to making the contention
> irrelevant, without having to rip apart the code.

I think there is a big difference with mmu_lock.

mmu_lock is analogous to mmap_lock in core-mm. It serializes page fault
against other mmu operation and have inherently vast scope.

pamt_lock on other hand is at very bottom of callchain and with very
limited scope. It is trivially scalable by partitioning.

Translating problems you see with mmu_lock onto pamt_lock seems like an
overreaction.

-- 
Kiryl Shutsemau / Kirill A. Shutemov