linux-kernel - Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aCxMtjuvYHk2oWbc@yzhao56-desk.sh.intel.com>
Date: Tue, 20 May 2025 17:34:46 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
CC: "Huang, Kai" <kai.huang@...el.com>, "Shutemov, Kirill"
	<kirill.shutemov@...el.com>, "Li, Xiaoyao" <xiaoyao.li@...el.com>, "Du, Fan"
	<fan.du@...el.com>, "Hansen, Dave" <dave.hansen@...el.com>,
	"david@...hat.com" <david@...hat.com>, "Li, Zhiquan1"
	<zhiquan1.li@...el.com>, "vbabka@...e.cz" <vbabka@...e.cz>,
	"tabba@...gle.com" <tabba@...gle.com>, "thomas.lendacky@....com"
	<thomas.lendacky@....com>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "seanjc@...gle.com" <seanjc@...gle.com>,
	"Weiny, Ira" <ira.weiny@...el.com>, "michael.roth@....com"
	<michael.roth@....com>, "pbonzini@...hat.com" <pbonzini@...hat.com>,
	"Yamahata, Isaku" <isaku.yamahata@...el.com>, "ackerleytng@...gle.com"
	<ackerleytng@...gle.com>, "binbin.wu@...ux.intel.com"
	<binbin.wu@...ux.intel.com>, "Peng, Chao P" <chao.p.peng@...el.com>,
	"quic_eberman@...cinc.com" <quic_eberman@...cinc.com>, "Annapurve, Vishal"
	<vannapurve@...gle.com>, "jroedel@...e.de" <jroedel@...e.de>, "Miao, Jun"
	<jun.miao@...el.com>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
	"pgonda@...gle.com" <pgonda@...gle.com>, "x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is
 RUNNABLE

On Tue, May 20, 2025 at 12:53:33AM +0800, Edgecombe, Rick P wrote:
> On Mon, 2025-05-19 at 16:32 +0800, Yan Zhao wrote:
> > > On the opposite, if other non-Linux TDs don't follow 1G->2M->4K accept
> > > order,
> > > e.g., they always accept 4K, there could be *endless EPT violation* if I
> > > understand your words correctly.
> > > 
> > > Isn't this yet-another reason we should choose to return PG_LEVEL_4K instead
> > > of
> > > 2M if no accept level is provided in the fault?
> > As I said, returning PG_LEVEL_4K would disallow huge pages for non-Linux TDs.
> > TD's accept operations at size > 4KB will get TDACCEPT_SIZE_MISMATCH.
> 
> TDX_PAGE_SIZE_MISMATCH is a valid error code that the guest should handle. The
> docs say the VMM needs to demote *if* the mapping is large and the accept size
> is small. But if we map at 4k size for non-accept EPT violations, we won't hit
> this case. I also wonder what is preventing the TDX module from handling a 2MB
> accept size at 4k mappings. It could be changed maybe.
> 
> But I think Kai's question was: why are we complicating the code for the case of
> non-Linux TDs that also use #VE for accept? It's not necessary to be functional,
> and there aren't any known TDs like that which are expected to use KVM today.
> (err, except the MMU stress test). So in another form the question is: should we
> optimize KVM for a case we don't even know if anyone will use? The answer seems
> obviously no to me.
So, you want to disallow huge pages for non-Linux TDs, then we have no need
to support splitting in the fault path, right?

I'm OK if we don't care non-Linux TDs for now.
This can simplify the splitting code and we can add the support when there's a
need.

> I think this connects the question of whether we can pass the necessary info
> into fault via synthetic error code. Consider this new design:
> 
>  - tdx_gmem_private_max_mapping_level() simply returns 4k for prefetch and pre-
> runnable, otherwise returns 2MB
Why prefetch and pre-runnable faults go the first path, while

>  - if fault has accept info 2MB size, pass 2MB size into fault. Otherwise pass
> 4k (i.e. VMs that are relying on #VE to do the accept won't get huge pages
> *yet*).
other faults go the second path?
 
> What goes wrong? Seems simpler and no more stuffing fault info on the vcpu.
I tried to avoid the double paths.
IMHO, it's confusing to specify max_level from two paths.

The fault info in vcpu_tdx isn't a real problem as it's per-vCPU.
An existing example in KVM is vcpu->arch.mmio_gfn.

We don't need something like the vcpu->arch.mmio_gen because
tdx->violation_gfn_* and tdx->violation_request_level are reset in each
tdx_handle_ept_violation().


BTW, dug into some history:

In v18 of TDX basic series,
enforcing 4KB for pre-runnable faults were done by passing PG_LEVEL_4K to
kvm_mmu_map_tdp_page().
https://lore.kernel.org/all/1a64f798b550dad9e096603e8dae3b6e8fb2fbd5.1705965635.git.isaku.yamahata@intel.com/
https://lore.kernel.org/all/97bb1f2996d8a7b828cd9e3309380d1a86ca681b.1705965635.git.isaku.yamahata@intel.com/

For the other faults, it's done by altering max_level in kvm_mmu_do_page_fault(),
and Paolo asked to use the tdx_gmem_private_max_mapping_level() path.
https://lore.kernel.org/all/CABgObfbu1-Ok607uYdo4DzwZf8ZGVQnvHU+y9_M1Zae55K5xwQ@mail.gmail.com/

For the patch "KVM: x86/mmu: Allow per-VM override of the TDP max page level",
it's initially acked by Paolo in v2, and Sean's reply is at
https://lore.kernel.org/all/YO3%2FgvK9A3tgYfT6@google.com .