linux-kernel - Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <a6ffe23fb97e64109f512fa43e9f6405236ed40a.camel@intel.com>
Date: Tue, 24 Jun 2025 18:35:59 +0000
From: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
To: "Zhao, Yan Y" <yan.y.zhao@...el.com>
CC: "quic_eberman@...cinc.com" <quic_eberman@...cinc.com>, "Li, Xiaoyao"
	<xiaoyao.li@...el.com>, "Huang, Kai" <kai.huang@...el.com>, "Du, Fan"
	<fan.du@...el.com>, "Hansen, Dave" <dave.hansen@...el.com>,
	"david@...hat.com" <david@...hat.com>, "thomas.lendacky@....com"
	<thomas.lendacky@....com>, "vbabka@...e.cz" <vbabka@...e.cz>, "Li, Zhiquan1"
	<zhiquan1.li@...el.com>, "Shutemov, Kirill" <kirill.shutemov@...el.com>,
	"michael.roth@....com" <michael.roth@....com>, "seanjc@...gle.com"
	<seanjc@...gle.com>, "Weiny, Ira" <ira.weiny@...el.com>, "Peng, Chao P"
	<chao.p.peng@...el.com>, "pbonzini@...hat.com" <pbonzini@...hat.com>,
	"Yamahata, Isaku" <isaku.yamahata@...el.com>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "binbin.wu@...ux.intel.com"
	<binbin.wu@...ux.intel.com>, "ackerleytng@...gle.com"
	<ackerleytng@...gle.com>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
	"Annapurve, Vishal" <vannapurve@...gle.com>, "tabba@...gle.com"
	<tabba@...gle.com>, "jroedel@...e.de" <jroedel@...e.de>, "Miao, Jun"
	<jun.miao@...el.com>, "pgonda@...gle.com" <pgonda@...gle.com>,
	"x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is
 RUNNABLE

On Tue, 2025-06-24 at 17:57 +0800, Yan Zhao wrote:
> Could we provide the info via the private_max_mapping_level hook (i.e. via
> tdx_gmem_private_max_mapping_level())?

This is one of the previous two methods discussed. Can you elaborate on what you
are trying to say?

> 
> Or what about introducing a vendor hook in __kvm_mmu_max_mapping_level() for a
> private fault?
> 
> > Maybe we could have EPT violations that contain 4k accept sizes first update the
> > attribute for the GFN to be accepted or not, like have tdx.c call out to set
> > kvm_lpage_info->disallow_lpage in the rarer case of 4k accept size? Or something
> Something like kvm_lpage_info->disallow_lpage would disallow later page
> promotion, though we don't support it right now.

Well I was originally thinking it would not set kvm_lpage_info->disallow_lpage
directly, but rely on the logic that checks for mixed attributes. But more
below...

> 
> > like that. Maybe set a "accepted" attribute, or something. Not sure if could be
> Setting "accepted" attribute in the EPT violation handler?
> It's a little odd, as the accept operation is not yet completed.

I guess the question in both of these comments is: what is the life cycle. Guest
could call TDG.MEM.PAGE.RELEASE to unaccept it as well. Oh, geez. It looks like
TDG.MEM.PAGE.RELEASE will give the same size hints in the EPT violation. So an
accept attribute is not going work, at least without TDX module changes.

Actually, the problem we have doesn't fit the mixed attributes behavior. If many
vCPU's accept at 2MB region at 4k page size, the entire 2MB range could be non-
mixed and then individual accepts would fail.

So instead there could be a KVM_LPAGE_GUEST_INHIBIT that doesn't get cleared
based on mixed attributes. It would be one way. It would need to get set by
something like kvm_write_track_add_gfn() that lives in tdx.c and is called
before going into the fault handler on 4k accept size. It would have to take mmu
write lock I think, which would kill scalability in the 4k accept case (but not
the normal 2MB one). But as long as mmu_write lock is held, demote will be no
problem, which the operation would also need to do.

I think it actually makes KVM's behavior easier to understand. We don't need to
worry about races between multiple accept sizes and things like that. It also
leaves the core MMU code mostly untouched. Performance/scalability wise it only
punishes the rare case.

For leaving the option open to promote the GFNs in the future, a GHCI interface
or similar could be defined for the guest to say "I don't care about page size
anymore for this gfn". So it won't close it off forever.

> 
> > done without the mmu write lock... But it might fit KVM better?