linux-kernel - Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0930ae315759558c52fd6afb837e6a8b9acc1cc3.camel@intel.com>
Date: Wed, 25 Jun 2025 14:47:47 +0000
From: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
To: "Zhao, Yan Y" <yan.y.zhao@...el.com>
CC: "Du, Fan" <fan.du@...el.com>, "Li, Xiaoyao" <xiaoyao.li@...el.com>,
	"Huang, Kai" <kai.huang@...el.com>, "Shutemov, Kirill"
	<kirill.shutemov@...el.com>, "Hansen, Dave" <dave.hansen@...el.com>,
	"david@...hat.com" <david@...hat.com>, "thomas.lendacky@....com"
	<thomas.lendacky@....com>, "vbabka@...e.cz" <vbabka@...e.cz>, "Li, Zhiquan1"
	<zhiquan1.li@...el.com>, "quic_eberman@...cinc.com"
	<quic_eberman@...cinc.com>, "michael.roth@....com" <michael.roth@....com>,
	"seanjc@...gle.com" <seanjc@...gle.com>, "Weiny, Ira" <ira.weiny@...el.com>,
	"pbonzini@...hat.com" <pbonzini@...hat.com>, "Peng, Chao P"
	<chao.p.peng@...el.com>, "Yamahata, Isaku" <isaku.yamahata@...el.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"binbin.wu@...ux.intel.com" <binbin.wu@...ux.intel.com>,
	"ackerleytng@...gle.com" <ackerleytng@...gle.com>, "kvm@...r.kernel.org"
	<kvm@...r.kernel.org>, "Annapurve, Vishal" <vannapurve@...gle.com>,
	"tabba@...gle.com" <tabba@...gle.com>, "jroedel@...e.de" <jroedel@...e.de>,
	"Miao, Jun" <jun.miao@...el.com>, "pgonda@...gle.com" <pgonda@...gle.com>,
	"x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is
 RUNNABLE

On Wed, 2025-06-25 at 17:28 +0800, Yan Zhao wrote:
> On Wed, Jun 25, 2025 at 02:35:59AM +0800, Edgecombe, Rick P wrote:
> > On Tue, 2025-06-24 at 17:57 +0800, Yan Zhao wrote:
> > > Could we provide the info via the private_max_mapping_level hook (i.e. via
> > > tdx_gmem_private_max_mapping_level())?
> > 
> > This is one of the previous two methods discussed. Can you elaborate on what you
> > are trying to say?
> I don't get why we can't use the existing tdx_gmem_private_max_mapping_level()
> to convey the max_level info at which a vendor hopes a GFN to be mapped.
> 
> Before TDX huge pages, tdx_gmem_private_max_mapping_level() always returns 4KB;
> after TDX huge pages, it returns
> - 4KB during the TD build stage
> - at TD runtime: 4KB or 2MB
> 
> Why does KVM need to care how the vendor determines this max_level?
> I think a vendor should have its freedom to decide based on software limitation,
> guest's wishes, hardware bugs or whatever.

I don't see that big of a difference between "KVM" and "vendor". TDX code is KVM
code. Just because it's in tdx.c doesn't mean it's ok for it to be hard to trace
the logic.

I'm not sure what Sean's objection was to that approach, or if he objected to
just the weird SIZE_MISMATCH behavior of TDX module. I think you already know
why I don't prefer it:
 - Requiring demote in the fault handler. This requires an additional write lock
inside the mmu read lock, or TDX module changes. Although now I wonder if the
interrupt error code related problems will get worse with this solution. The
solution is currently not settled.
 - Requiring passing args on the vCPU struct, which as you point out will work
functionally today only because the prefault stuff will avoid seeing it. But
it's fragile
 - The page size behavior is a bit implicit

I'm coming back to this draft after PUCK. Sean shared his thoughts there. I'll
try to summarize. He didn't like how the page size requirements were passed
through the fault handler in a "transient" way. That "transient" property covers
both of the two options for passing the size info through the fault handler that
we were debating. He also didn't like how TDX arch requires KVM to map at a
specific host size around accept. Michael Roth brought up that SNP has the same
requirement, but it can do the zap and refault approach.

We then discussed this lpage_info idea. He was in favor of it, but not, I'd say,
overly enthusiastic. In a "least worst option" kind of way.

I think the biggest downside is the MMU write lock. Our goal for this series is
to help performance, not to get huge page sizes. So if we do this idea, we can't
fully waive our hands that any optimization is pre-mature. It *is* an
optimization. We need to either convince ourselves that the overall benefit is
still there, or have a plan to adopt the guest to avoid 4k accepts. Which we
were previously discussing of requiring anyway.

But I much prefer the deterministic behavior of this approach from a
maintainability perspective.

> 
> > > Or what about introducing a vendor hook in __kvm_mmu_max_mapping_level() for a
> > > private fault?
> > > 
> > > > Maybe we could have EPT violations that contain 4k accept sizes first update the
> > > > attribute for the GFN to be accepted or not, like have tdx.c call out to set
> > > > kvm_lpage_info->disallow_lpage in the rarer case of 4k accept size? Or something
> > > Something like kvm_lpage_info->disallow_lpage would disallow later page
> > > promotion, though we don't support it right now.
> > 
> > Well I was originally thinking it would not set kvm_lpage_info->disallow_lpage
> > directly, but rely on the logic that checks for mixed attributes. But more
> > below...
> > 
> > > 
> > > > like that. Maybe set a "accepted" attribute, or something. Not sure if could be
> > > Setting "accepted" attribute in the EPT violation handler?
> > > It's a little odd, as the accept operation is not yet completed.
> > 
> > I guess the question in both of these comments is: what is the life cycle. Guest
> > could call TDG.MEM.PAGE.RELEASE to unaccept it as well. Oh, geez. It looks like
> > TDG.MEM.PAGE.RELEASE will give the same size hints in the EPT violation. So an
> > accept attribute is not going work, at least without TDX module changes.
> > 
> > 
> > Actually, the problem we have doesn't fit the mixed attributes behavior. If many
> > vCPU's accept at 2MB region at 4k page size, the entire 2MB range could be non-
> > mixed and then individual accepts would fail.
> > 
> > 
> > So instead there could be a KVM_LPAGE_GUEST_INHIBIT that doesn't get cleared
> Set KVM_LPAGE_GUEST_INHIBIT via a TDVMCALL ?
> 
> Or just set the KVM_LPAGE_GUEST_INHIBIT when an EPT violation contains 4KB
> level info?

Yes, that's the idea. 2MB accepts can behave like normal.

> 
> I guess it's the latter one as it can avoid modification to both EDK2 and Linux
> guest.  I observed ~2710 instances of "guest accepts at 4KB when KVM can map at
> 2MB" during the boot-up of a TD with 4GB memory.

Oh, wow that is more than I expected. Did you notice how many vCPUs they were
spread across? What memory size did you use? What was your guest memory
configuration?

> 
> But does it mean TDX needs to hold write mmu_lock in the EPT violation handler
> and set KVM_LPAGE_GUEST_INHIBIT on finding a violation carries 4KB level info?

I think so. I didn't check the reason, but the other similar code took it. Maybe
not? If we don't need to take mmu write lock, then this idea seems like a clear
winner to me.

> 
> > based on mixed attributes. It would be one way. It would need to get set by
> > something like kvm_write_track_add_gfn() that lives in tdx.c and is called
> > before going into the fault handler on 4k accept size. It would have to take mmu
> > write lock I think, which would kill scalability in the 4k accept case (but not
> > the normal 2MB one). But as long as mmu_write lock is held, demote will be no
> > problem, which the operation would also need to do.
> > 
> > I think it actually makes KVM's behavior easier to understand. We don't need to
> > worry about races between multiple accept sizes and things like that. It also
> > leaves the core MMU code mostly untouched. Performance/scalability wise it only
> > punishes the rare case.
> Write down my understanding to check if it's correct:

Will respond to this part on your later mail with corrections.