[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f13239faa54062feb9937fe9fc6d087ffcca7ac6.camel@intel.com>
Date: Wed, 2 Jul 2025 00:18:48 +0000
From: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
To: "Zhao, Yan Y" <yan.y.zhao@...el.com>
CC: "quic_eberman@...cinc.com" <quic_eberman@...cinc.com>, "Li, Xiaoyao"
<xiaoyao.li@...el.com>, "Huang, Kai" <kai.huang@...el.com>, "Du, Fan"
<fan.du@...el.com>, "Hansen, Dave" <dave.hansen@...el.com>,
"david@...hat.com" <david@...hat.com>, "thomas.lendacky@....com"
<thomas.lendacky@....com>, "vbabka@...e.cz" <vbabka@...e.cz>, "Li, Zhiquan1"
<zhiquan1.li@...el.com>, "Shutemov, Kirill" <kirill.shutemov@...el.com>,
"michael.roth@....com" <michael.roth@....com>, "seanjc@...gle.com"
<seanjc@...gle.com>, "Weiny, Ira" <ira.weiny@...el.com>, "Peng, Chao P"
<chao.p.peng@...el.com>, "pbonzini@...hat.com" <pbonzini@...hat.com>,
"Yamahata, Isaku" <isaku.yamahata@...el.com>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>, "binbin.wu@...ux.intel.com"
<binbin.wu@...ux.intel.com>, "ackerleytng@...gle.com"
<ackerleytng@...gle.com>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
"Annapurve, Vishal" <vannapurve@...gle.com>, "tabba@...gle.com"
<tabba@...gle.com>, "jroedel@...e.de" <jroedel@...e.de>, "Miao, Jun"
<jun.miao@...el.com>, "pgonda@...gle.com" <pgonda@...gle.com>,
"x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is
RUNNABLE
On Wed, 2025-07-02 at 08:12 +0800, Yan Zhao wrote:
> > Then (3) could be skipped in the case of ability to demote under read lock?
> >
> > I noticed that the other lpage_info updaters took mmu write lock, and I
> > wasn't
> > sure why. We shouldn't take a lock that we don't actually need just for
> > safety
> > margin or to copy other code.
> Use write mmu_lock is of reason.
>
> In the 3 steps,
> 1. set lpage_info
> 2. demote if needed
> 3. go to fault handler
>
> Step 2 requires holding write mmu_lock before invoking
> kvm_split_boundary_leafs().
> The write mmu_lock is also possible to get dropped and re-acquired in
> kvm_split_boundary_leafs() for purpose like memory allocation.
>
> If 1 is with read mmu_lock, the other vCPUs is still possible to fault in at
> 2MB
> level after the demote in step 2.
> Luckily, current TDX doesn't support promotion now.
> But we can avoid wading into this complex situation by holding write mmu_lock
> in 1.
I don't think because some code might race in the future is a good reason to
take the write lock.
>
> > > > For most accept
> > > > cases, we could fault in the PTE's on the read lock. And in the future
> > > > we
> > > > could
> > >
> > > The actual mapping at 4KB level is still with read mmu_lock in
> > > __vmx_handle_ept_violation().
> > >
> > > > have a demote that could work under read lock, as we talked. So
> > > > kvm_split_boundary_leafs() often or could be unneeded or work under read
> > > > lock
> > > > when needed.
> > > Could we leave the "demote under read lock" as a future optimization?
> >
> > We could add it to the list. If we have a TDX module that supports demote
> > with a
> > single SEAMCALL then we don't have the rollback problem. The optimization
> > could
> > utilize that. That said, we should focus on the optimizations that make the
> > biggest difference to real TDs. Your data suggests this might not be the
> > case
> > today.
> Ok.
>
> > > > What is the problem in hugepage_set_guest_inhibit() that requires the
> > > > write
> > > > lock?
> > > As above, to avoid the other vCPUs reading stale mapping level and
> > > splitting
> > > under read mmu_lock.
> >
> > We need mmu write lock for demote, but as long as the order is:
> > 1. set lpage_info
> > 2. demote if needed
> > 3. go to fault handler
> >
> > Then (3) should have what it needs even if another fault races (1).
> See the above comment for why we need to hold write mmu_lock for 1.
>
> Besides, as we need write mmu_lock anyway for 2 (i.e. hold write mmu_lock
> before
> walking the SPTEs to check if there's any existing mapping), I don't see any
> performance impact by holding write mmu_lock for 1.
It's maintainability problem too. Someday someone may want to remove it and
scratch their head for what race they are missing.
>
>
> > > As guest_inhibit is set one-way, we could test it using
> > > hugepage_test_guest_inhibit() without holding the lock. The chance to hold
> > > write
> > > mmu_lock for hugepage_set_guest_inhibit() is then greatly reduced.
> > > (in my testing, 11 during VM boot).
> > >
> > > > But in any case, it seems like we have *a* solution here. It doesn't
> > > > seem
> > > > like
> > > > there are any big downsides. Should we close it?
> > > I think it's good, as long as Sean doesn't disagree :)
> >
> > He seemed onboard. Let's close it. We can even discuss lpage_info update
> > locking
> > on v2.
> Ok. I'll use write mmu_lock for updating lpage_info in v2 first.
Specifically, why do the other lpage_info updating functions take mmu write
lock. Are you sure there is no other reason?
Powered by blists - more mailing lists