linux-kernel - Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aGSGPc+aovhd/SsN@yzhao56-desk.sh.intel.com>
Date: Wed, 2 Jul 2025 09:07:09 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
CC: "quic_eberman@...cinc.com" <quic_eberman@...cinc.com>, "Li, Xiaoyao"
	<xiaoyao.li@...el.com>, "Huang, Kai" <kai.huang@...el.com>, "Du, Fan"
	<fan.du@...el.com>, "Hansen, Dave" <dave.hansen@...el.com>,
	"david@...hat.com" <david@...hat.com>, "thomas.lendacky@....com"
	<thomas.lendacky@....com>, "vbabka@...e.cz" <vbabka@...e.cz>, "Li, Zhiquan1"
	<zhiquan1.li@...el.com>, "Shutemov, Kirill" <kirill.shutemov@...el.com>,
	"michael.roth@....com" <michael.roth@....com>, "seanjc@...gle.com"
	<seanjc@...gle.com>, "Weiny, Ira" <ira.weiny@...el.com>, "Peng, Chao P"
	<chao.p.peng@...el.com>, "pbonzini@...hat.com" <pbonzini@...hat.com>,
	"Yamahata, Isaku" <isaku.yamahata@...el.com>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "binbin.wu@...ux.intel.com"
	<binbin.wu@...ux.intel.com>, "ackerleytng@...gle.com"
	<ackerleytng@...gle.com>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
	"Annapurve, Vishal" <vannapurve@...gle.com>, "tabba@...gle.com"
	<tabba@...gle.com>, "jroedel@...e.de" <jroedel@...e.de>, "Miao, Jun"
	<jun.miao@...el.com>, "pgonda@...gle.com" <pgonda@...gle.com>,
	"x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is
 RUNNABLE

On Wed, Jul 02, 2025 at 08:18:48AM +0800, Edgecombe, Rick P wrote:
> On Wed, 2025-07-02 at 08:12 +0800, Yan Zhao wrote:
> > > Then (3) could be skipped in the case of ability to demote under read lock?
> > > 
> > > I noticed that the other lpage_info updaters took mmu write lock, and I
> > > wasn't
> > > sure why. We shouldn't take a lock that we don't actually need just for
> > > safety
> > > margin or to copy other code.
> > Use write mmu_lock is of reason.
> > 
> > In the 3 steps, 
> > 1. set lpage_info
> > 2. demote if needed
> > 3. go to fault handler
> > 
> > Step 2 requires holding write mmu_lock before invoking
> > kvm_split_boundary_leafs().
> > The write mmu_lock is also possible to get dropped and re-acquired in
> > kvm_split_boundary_leafs() for purpose like memory allocation.
> > 
> > If 1 is with read mmu_lock, the other vCPUs is still possible to fault in at
> > 2MB
> > level after the demote in step 2.
> > Luckily, current TDX doesn't support promotion now.
> > But we can avoid wading into this complex situation by holding write mmu_lock
> > in 1.
> 
> I don't think because some code might race in the future is a good reason to
> take the write lock.

I still prefer to hold write mmu_lock right now.

Otherwise, we at least need to convert disallow_lpage to atomic variable and
updating it via an atomic way, e.g. cmpxchg. 

struct kvm_lpage_info {
        int disallow_lpage;
};


> > > > > For most accept
> > > > > cases, we could fault in the PTE's on the read lock. And in the future
> > > > > we
> > > > > could
> > > > 
> > > > The actual mapping at 4KB level is still with read mmu_lock in
> > > > __vmx_handle_ept_violation().
> > > > 
> > > > > have a demote that could work under read lock, as we talked. So
> > > > > kvm_split_boundary_leafs() often or could be unneeded or work under read
> > > > > lock
> > > > > when needed.
> > > > Could we leave the "demote under read lock" as a future optimization? 
> > > 
> > > We could add it to the list. If we have a TDX module that supports demote
> > > with a
> > > single SEAMCALL then we don't have the rollback problem. The optimization
> > > could
> > > utilize that. That said, we should focus on the optimizations that make the
> > > biggest difference to real TDs. Your data suggests this might not be the
> > > case
> > > today. 
> > Ok. 
> >  
> > > > > What is the problem in hugepage_set_guest_inhibit() that requires the
> > > > > write
> > > > > lock?
> > > > As above, to avoid the other vCPUs reading stale mapping level and
> > > > splitting
> > > > under read mmu_lock.
> > > 
> > > We need mmu write lock for demote, but as long as the order is:
> > > 1. set lpage_info
> > > 2. demote if needed
> > > 3. go to fault handler
> > > 
> > > Then (3) should have what it needs even if another fault races (1).
> > See the above comment for why we need to hold write mmu_lock for 1.
> > 
> > Besides, as we need write mmu_lock anyway for 2 (i.e. hold write mmu_lock
> > before
> > walking the SPTEs to check if there's any existing mapping), I don't see any
> > performance impact by holding write mmu_lock for 1.
> 
> It's maintainability problem too. Someday someone may want to remove it and
> scratch their head for what race they are missing.
I don't get why holding write mmu_lock will cause maintainability problem.
In contrast, if we want to use read mmu_lock in future, we need to carefully
check if there's any potential risk.

> > > > As guest_inhibit is set one-way, we could test it using
> > > > hugepage_test_guest_inhibit() without holding the lock. The chance to hold
> > > > write
> > > > mmu_lock for hugepage_set_guest_inhibit() is then greatly reduced.
> > > > (in my testing, 11 during VM boot).
> > > >  
> > > > > But in any case, it seems like we have *a* solution here. It doesn't
> > > > > seem
> > > > > like
> > > > > there are any big downsides. Should we close it?
> > > > I think it's good, as long as Sean doesn't disagree :)
> > > 
> > > He seemed onboard. Let's close it. We can even discuss lpage_info update
> > > locking
> > > on v2.
> > Ok. I'll use write mmu_lock for updating lpage_info in v2 first.
> 
> Specifically, why do the other lpage_info updating functions take mmu write
> lock. Are you sure there is no other reason?
1. The read mmu_lock can't prevent the other vCPUs from reading stale lpage_info.
2. Shadow code in KVM MMU only holds write mmu_lock, so it updates lpage_info
   with write mmu_lock.
3. lpage_info is not updated atomically. If there're two vCPUs updating
   lpage_info concurrently, lpage_info may hold an invalid value.
4. lpage_info is not updated in performance critical paths. No need to hold
   read mmu_lock.