lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f13239faa54062feb9937fe9fc6d087ffcca7ac6.camel@intel.com>
Date: Wed, 2 Jul 2025 00:18:48 +0000
From: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
To: "Zhao, Yan Y" <yan.y.zhao@...el.com>
CC: "quic_eberman@...cinc.com" <quic_eberman@...cinc.com>, "Li, Xiaoyao"
	<xiaoyao.li@...el.com>, "Huang, Kai" <kai.huang@...el.com>, "Du, Fan"
	<fan.du@...el.com>, "Hansen, Dave" <dave.hansen@...el.com>,
	"david@...hat.com" <david@...hat.com>, "thomas.lendacky@....com"
	<thomas.lendacky@....com>, "vbabka@...e.cz" <vbabka@...e.cz>, "Li, Zhiquan1"
	<zhiquan1.li@...el.com>, "Shutemov, Kirill" <kirill.shutemov@...el.com>,
	"michael.roth@....com" <michael.roth@....com>, "seanjc@...gle.com"
	<seanjc@...gle.com>, "Weiny, Ira" <ira.weiny@...el.com>, "Peng, Chao P"
	<chao.p.peng@...el.com>, "pbonzini@...hat.com" <pbonzini@...hat.com>,
	"Yamahata, Isaku" <isaku.yamahata@...el.com>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "binbin.wu@...ux.intel.com"
	<binbin.wu@...ux.intel.com>, "ackerleytng@...gle.com"
	<ackerleytng@...gle.com>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
	"Annapurve, Vishal" <vannapurve@...gle.com>, "tabba@...gle.com"
	<tabba@...gle.com>, "jroedel@...e.de" <jroedel@...e.de>, "Miao, Jun"
	<jun.miao@...el.com>, "pgonda@...gle.com" <pgonda@...gle.com>,
	"x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is
 RUNNABLE

On Wed, 2025-07-02 at 08:12 +0800, Yan Zhao wrote:
> > Then (3) could be skipped in the case of ability to demote under read lock?
> > 
> > I noticed that the other lpage_info updaters took mmu write lock, and I
> > wasn't
> > sure why. We shouldn't take a lock that we don't actually need just for
> > safety
> > margin or to copy other code.
> Use write mmu_lock is of reason.
> 
> In the 3 steps, 
> 1. set lpage_info
> 2. demote if needed
> 3. go to fault handler
> 
> Step 2 requires holding write mmu_lock before invoking
> kvm_split_boundary_leafs().
> The write mmu_lock is also possible to get dropped and re-acquired in
> kvm_split_boundary_leafs() for purpose like memory allocation.
> 
> If 1 is with read mmu_lock, the other vCPUs is still possible to fault in at
> 2MB
> level after the demote in step 2.
> Luckily, current TDX doesn't support promotion now.
> But we can avoid wading into this complex situation by holding write mmu_lock
> in 1.

I don't think because some code might race in the future is a good reason to
take the write lock.

> 
> > > > For most accept
> > > > cases, we could fault in the PTE's on the read lock. And in the future
> > > > we
> > > > could
> > > 
> > > The actual mapping at 4KB level is still with read mmu_lock in
> > > __vmx_handle_ept_violation().
> > > 
> > > > have a demote that could work under read lock, as we talked. So
> > > > kvm_split_boundary_leafs() often or could be unneeded or work under read
> > > > lock
> > > > when needed.
> > > Could we leave the "demote under read lock" as a future optimization? 
> > 
> > We could add it to the list. If we have a TDX module that supports demote
> > with a
> > single SEAMCALL then we don't have the rollback problem. The optimization
> > could
> > utilize that. That said, we should focus on the optimizations that make the
> > biggest difference to real TDs. Your data suggests this might not be the
> > case
> > today. 
> Ok. 
>  
> > > > What is the problem in hugepage_set_guest_inhibit() that requires the
> > > > write
> > > > lock?
> > > As above, to avoid the other vCPUs reading stale mapping level and
> > > splitting
> > > under read mmu_lock.
> > 
> > We need mmu write lock for demote, but as long as the order is:
> > 1. set lpage_info
> > 2. demote if needed
> > 3. go to fault handler
> > 
> > Then (3) should have what it needs even if another fault races (1).
> See the above comment for why we need to hold write mmu_lock for 1.
> 
> Besides, as we need write mmu_lock anyway for 2 (i.e. hold write mmu_lock
> before
> walking the SPTEs to check if there's any existing mapping), I don't see any
> performance impact by holding write mmu_lock for 1.

It's maintainability problem too. Someday someone may want to remove it and
scratch their head for what race they are missing.

> 
> 
> > > As guest_inhibit is set one-way, we could test it using
> > > hugepage_test_guest_inhibit() without holding the lock. The chance to hold
> > > write
> > > mmu_lock for hugepage_set_guest_inhibit() is then greatly reduced.
> > > (in my testing, 11 during VM boot).
> > >  
> > > > But in any case, it seems like we have *a* solution here. It doesn't
> > > > seem
> > > > like
> > > > there are any big downsides. Should we close it?
> > > I think it's good, as long as Sean doesn't disagree :)
> > 
> > He seemed onboard. Let's close it. We can even discuss lpage_info update
> > locking
> > on v2.
> Ok. I'll use write mmu_lock for updating lpage_info in v2 first.

Specifically, why do the other lpage_info updating functions take mmu write
lock. Are you sure there is no other reason?

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ