linux-kernel - Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <908a8abdf0544d4fba23d3667651c4cfcafa9c4f.camel@intel.com>
Date: Tue, 1 Jul 2025 15:36:22 +0000
From: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
To: "Zhao, Yan Y" <yan.y.zhao@...el.com>
CC: "Du, Fan" <fan.du@...el.com>, "Li, Xiaoyao" <xiaoyao.li@...el.com>,
	"Huang, Kai" <kai.huang@...el.com>, "Shutemov, Kirill"
	<kirill.shutemov@...el.com>, "Hansen, Dave" <dave.hansen@...el.com>,
	"david@...hat.com" <david@...hat.com>, "thomas.lendacky@....com"
	<thomas.lendacky@....com>, "vbabka@...e.cz" <vbabka@...e.cz>, "Li, Zhiquan1"
	<zhiquan1.li@...el.com>, "quic_eberman@...cinc.com"
	<quic_eberman@...cinc.com>, "michael.roth@....com" <michael.roth@....com>,
	"seanjc@...gle.com" <seanjc@...gle.com>, "Weiny, Ira" <ira.weiny@...el.com>,
	"pbonzini@...hat.com" <pbonzini@...hat.com>, "Peng, Chao P"
	<chao.p.peng@...el.com>, "Yamahata, Isaku" <isaku.yamahata@...el.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"binbin.wu@...ux.intel.com" <binbin.wu@...ux.intel.com>,
	"ackerleytng@...gle.com" <ackerleytng@...gle.com>, "kvm@...r.kernel.org"
	<kvm@...r.kernel.org>, "Annapurve, Vishal" <vannapurve@...gle.com>,
	"tabba@...gle.com" <tabba@...gle.com>, "jroedel@...e.de" <jroedel@...e.de>,
	"Miao, Jun" <jun.miao@...el.com>, "pgonda@...gle.com" <pgonda@...gle.com>,
	"x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is
 RUNNABLE

On Tue, 2025-07-01 at 10:41 +0800, Yan Zhao wrote:
> > Can you explain what you found regarding the write lock need?
> Here, the write lock protects 2 steps:
> (1) update lpage_info.
> (2) try splitting if there's any existing 2MB mapping.
> 
> The write mmu_lock is needed because lpage_info is read under read mmu_lock in
> kvm_tdp_mmu_map().
> 
> kvm_tdp_mmu_map
>   kvm_mmu_hugepage_adjust
>     kvm_lpage_info_max_mapping_level
> 
> If we update the lpage_info with read mmu_lock, the other vCPUs may map at a
> stale 2MB level even after lpage_info is updated by
> hugepage_set_guest_inhibit().
> 
> Therefore, we must perform splitting under the write mmu_lock to ensure there
> are no 2MB mappings after hugepage_set_guest_inhibit().
> 
> Otherwise, during later mapping in __vmx_handle_ept_violation(), splitting at
> fault path could be triggered as KVM MMU finds the goal level is 4KB while an
> existing 2MB mapping is present.

It could be?
1. mmu read lock
2. update lpage_info
3. mmu write lock upgrade
4. demote
5. mmu unlock

Then (3) could be skipped in the case of ability to demote under read lock?

I noticed that the other lpage_info updaters took mmu write lock, and I wasn't
sure why. We shouldn't take a lock that we don't actually need just for safety
margin or to copy other code.

> 
> 
> > For most accept
> > cases, we could fault in the PTE's on the read lock. And in the future we
> > could
> 
> The actual mapping at 4KB level is still with read mmu_lock in
> __vmx_handle_ept_violation().
> 
> > have a demote that could work under read lock, as we talked. So
> > kvm_split_boundary_leafs() often or could be unneeded or work under read
> > lock
> > when needed.
> Could we leave the "demote under read lock" as a future optimization? 

We could add it to the list. If we have a TDX module that supports demote with a
single SEAMCALL then we don't have the rollback problem. The optimization could
utilize that. That said, we should focus on the optimizations that make the
biggest difference to real TDs. Your data suggests this might not be the case
today. 

> 
> 
> > What is the problem in hugepage_set_guest_inhibit() that requires the write
> > lock?
> As above, to avoid the other vCPUs reading stale mapping level and splitting
> under read mmu_lock.

We need mmu write lock for demote, but as long as the order is:
1. set lpage_info
2. demote if needed
3. go to fault handler

Then (3) should have what it needs even if another fault races (1).

> 
> As guest_inhibit is set one-way, we could test it using
> hugepage_test_guest_inhibit() without holding the lock. The chance to hold
> write
> mmu_lock for hugepage_set_guest_inhibit() is then greatly reduced.
> (in my testing, 11 during VM boot).
>  
> > But in any case, it seems like we have *a* solution here. It doesn't seem
> > like
> > there are any big downsides. Should we close it?
> I think it's good, as long as Sean doesn't disagree :)

He seemed onboard. Let's close it. We can even discuss lpage_info update locking
on v2.