linux-kernel - Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aFvBNromdrkEtPp6@yzhao56-desk.sh.intel.com>
Date: Wed, 25 Jun 2025 17:28:22 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
CC: "quic_eberman@...cinc.com" <quic_eberman@...cinc.com>, "Li, Xiaoyao"
	<xiaoyao.li@...el.com>, "Huang, Kai" <kai.huang@...el.com>, "Du, Fan"
	<fan.du@...el.com>, "Hansen, Dave" <dave.hansen@...el.com>,
	"david@...hat.com" <david@...hat.com>, "thomas.lendacky@....com"
	<thomas.lendacky@....com>, "vbabka@...e.cz" <vbabka@...e.cz>, "Li, Zhiquan1"
	<zhiquan1.li@...el.com>, "Shutemov, Kirill" <kirill.shutemov@...el.com>,
	"michael.roth@....com" <michael.roth@....com>, "seanjc@...gle.com"
	<seanjc@...gle.com>, "Weiny, Ira" <ira.weiny@...el.com>, "Peng, Chao P"
	<chao.p.peng@...el.com>, "pbonzini@...hat.com" <pbonzini@...hat.com>,
	"Yamahata, Isaku" <isaku.yamahata@...el.com>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "binbin.wu@...ux.intel.com"
	<binbin.wu@...ux.intel.com>, "ackerleytng@...gle.com"
	<ackerleytng@...gle.com>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
	"Annapurve, Vishal" <vannapurve@...gle.com>, "tabba@...gle.com"
	<tabba@...gle.com>, "jroedel@...e.de" <jroedel@...e.de>, "Miao, Jun"
	<jun.miao@...el.com>, "pgonda@...gle.com" <pgonda@...gle.com>,
	"x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is
 RUNNABLE

On Wed, Jun 25, 2025 at 02:35:59AM +0800, Edgecombe, Rick P wrote:
> On Tue, 2025-06-24 at 17:57 +0800, Yan Zhao wrote:
> > Could we provide the info via the private_max_mapping_level hook (i.e. via
> > tdx_gmem_private_max_mapping_level())?
> 
> This is one of the previous two methods discussed. Can you elaborate on what you
> are trying to say?
I don't get why we can't use the existing tdx_gmem_private_max_mapping_level()
to convey the max_level info at which a vendor hopes a GFN to be mapped.

Before TDX huge pages, tdx_gmem_private_max_mapping_level() always returns 4KB;
after TDX huge pages, it returns
- 4KB during the TD build stage
- at TD runtime: 4KB or 2MB

Why does KVM need to care how the vendor determines this max_level?
I think a vendor should have its freedom to decide based on software limitation,
guest's wishes, hardware bugs or whatever.

> > Or what about introducing a vendor hook in __kvm_mmu_max_mapping_level() for a
> > private fault?
> > 
> > > Maybe we could have EPT violations that contain 4k accept sizes first update the
> > > attribute for the GFN to be accepted or not, like have tdx.c call out to set
> > > kvm_lpage_info->disallow_lpage in the rarer case of 4k accept size? Or something
> > Something like kvm_lpage_info->disallow_lpage would disallow later page
> > promotion, though we don't support it right now.
> 
> Well I was originally thinking it would not set kvm_lpage_info->disallow_lpage
> directly, but rely on the logic that checks for mixed attributes. But more
> below...
> 
> > 
> > > like that. Maybe set a "accepted" attribute, or something. Not sure if could be
> > Setting "accepted" attribute in the EPT violation handler?
> > It's a little odd, as the accept operation is not yet completed.
> 
> I guess the question in both of these comments is: what is the life cycle. Guest
> could call TDG.MEM.PAGE.RELEASE to unaccept it as well. Oh, geez. It looks like
> TDG.MEM.PAGE.RELEASE will give the same size hints in the EPT violation. So an
> accept attribute is not going work, at least without TDX module changes.
> 
> 
> Actually, the problem we have doesn't fit the mixed attributes behavior. If many
> vCPU's accept at 2MB region at 4k page size, the entire 2MB range could be non-
> mixed and then individual accepts would fail.
> 
> 
> So instead there could be a KVM_LPAGE_GUEST_INHIBIT that doesn't get cleared
Set KVM_LPAGE_GUEST_INHIBIT via a TDVMCALL ?

Or just set the KVM_LPAGE_GUEST_INHIBIT when an EPT violation contains 4KB
level info?

I guess it's the latter one as it can avoid modification to both EDK2 and Linux
guest.  I observed ~2710 instances of "guest accepts at 4KB when KVM can map at
2MB" during the boot-up of a TD with 4GB memory.

But does it mean TDX needs to hold write mmu_lock in the EPT violation handler
and set KVM_LPAGE_GUEST_INHIBIT on finding a violation carries 4KB level info?

> based on mixed attributes. It would be one way. It would need to get set by
> something like kvm_write_track_add_gfn() that lives in tdx.c and is called
> before going into the fault handler on 4k accept size. It would have to take mmu
> write lock I think, which would kill scalability in the 4k accept case (but not
> the normal 2MB one). But as long as mmu_write lock is held, demote will be no
> problem, which the operation would also need to do.
> 
> I think it actually makes KVM's behavior easier to understand. We don't need to
> worry about races between multiple accept sizes and things like that. It also
> leaves the core MMU code mostly untouched. Performance/scalability wise it only
> punishes the rare case.
Write down my understanding to check if it's correct:

- when a TD is NOT configured to support KVM_LPAGE_GUEST_INHIBIT TDVMCALL, KVM
  always maps at 4KB

- When a TD is configured to support KVM_LPAGE_GUEST_INHIBIT TDVMCALL,

(a)
1. guest accepts at 4KB
2. TDX sets KVM_LPAGE_GUEST_INHIBIT and try splitting.(with write mmu_lock)
3. KVM maps at 4KB (with read mmu_lock)
4. guest's 4KB accept succeeds.

(b)
1. guest accepts at 2MB.
2. KVM maps at 4KB due to a certain reason.
3. guest's accept 2MB fails with TDACCEPT_SIZE_MISMATCH.
4. guest accepts at 4KB
5. guest's 4KB accept succeeds.

> For leaving the option open to promote the GFNs in the future, a GHCI interface
> or similar could be defined for the guest to say "I don't care about page size
> anymore for this gfn". So it won't close it off forever.
ok.