linux-kernel - Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d922d322901246bd3ee0ba745cdbe765078e92bd.camel@intel.com>
Date: Wed, 21 May 2025 15:40:15 +0000
From: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
To: "Zhao, Yan Y" <yan.y.zhao@...el.com>
CC: "Du, Fan" <fan.du@...el.com>, "Li, Xiaoyao" <xiaoyao.li@...el.com>,
	"Huang, Kai" <kai.huang@...el.com>, "Shutemov, Kirill"
	<kirill.shutemov@...el.com>, "Hansen, Dave" <dave.hansen@...el.com>,
	"david@...hat.com" <david@...hat.com>, "thomas.lendacky@....com"
	<thomas.lendacky@....com>, "tabba@...gle.com" <tabba@...gle.com>, "Li,
 Zhiquan1" <zhiquan1.li@...el.com>, "quic_eberman@...cinc.com"
	<quic_eberman@...cinc.com>, "michael.roth@....com" <michael.roth@....com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"seanjc@...gle.com" <seanjc@...gle.com>, "pbonzini@...hat.com"
	<pbonzini@...hat.com>, "binbin.wu@...ux.intel.com"
	<binbin.wu@...ux.intel.com>, "Weiny, Ira" <ira.weiny@...el.com>, "Yamahata,
 Isaku" <isaku.yamahata@...el.com>, "vbabka@...e.cz" <vbabka@...e.cz>,
	"ackerleytng@...gle.com" <ackerleytng@...gle.com>, "Peng, Chao P"
	<chao.p.peng@...el.com>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
	"Annapurve, Vishal" <vannapurve@...gle.com>, "jroedel@...e.de"
	<jroedel@...e.de>, "Miao, Jun" <jun.miao@...el.com>, "pgonda@...gle.com"
	<pgonda@...gle.com>, "x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is
 RUNNABLE

On Tue, 2025-05-20 at 17:34 +0800, Yan Zhao wrote:
> So, you want to disallow huge pages for non-Linux TDs, then we have no need
> to support splitting in the fault path, right?
> 
> I'm OK if we don't care non-Linux TDs for now.
> This can simplify the splitting code and we can add the support when there's a
> need.

We do need to care about non-Linux TDs functioning, but we don't need to
optimize for them at this point. We need to optimize for things that happen
often. Pending-#VE using TDs are rare, and don't need to have huge pages in
order to work.

Yesterday Kirill and I were chatting offline about the newly defined
TDG.MEM.PAGE.RELEASE. It is kind of like an unaccept, so another possibility is:
1. Guest accepts at 2MB
2. Guest releases at 2MB (no notice to VMM)
3. Guest accepts at 4k, EPT violation with expectation to demote

In that case, KVM won't know to expect it, and that it needs to preemptively map
things at 4k.

For full coverage of the issue, can we discuss a little bit about what demote in
the fault path would look like? The current zapping operation that is involved
depends on mmu write lock. And I remember you had a POC that added essentially a
hidden exclusive lock in TDX code as a substitute. But unlike the other callers,
the fault path demote case could actually handle failure. So if we just returned
busy and didn't try to force the retry, we would just run the risk of
interfering with TDX module sept lock? Is that the only issue with a design that
would allows failure of demote in the fault path?

Let's keep in mind that we could ask for TDX module changes to enable this path.
I think we could probably get away with ignoring TDG.MEM.PAGE.RELEASE if we had
a plan to fix it up with TDX module changes. And if the ultimate root cause of
the complication is avoiding zero-step (sept lock), we should fix that instead
of design around it further.

> 
> > I think this connects the question of whether we can pass the necessary info
> > into fault via synthetic error code. Consider this new design:
> > 
> >   - tdx_gmem_private_max_mapping_level() simply returns 4k for prefetch and pre-
> > runnable, otherwise returns 2MB
> Why prefetch and pre-runnable faults go the first path, while

Because these are either passed into private_max_mapping_level(), or not
associated with the fault (runnable state).

> 
> >   - if fault has accept info 2MB size, pass 2MB size into fault. Otherwise pass
> > 4k (i.e. VMs that are relying on #VE to do the accept won't get huge pages
> > *yet*).
> other faults go the second path?

This info is related to the specific fault.

>  
> > What goes wrong? Seems simpler and no more stuffing fault info on the vcpu.
> I tried to avoid the double paths.
> IMHO, it's confusing to specify max_level from two paths.
> 
> The fault info in vcpu_tdx isn't a real problem as it's per-vCPU.
> An existing example in KVM is vcpu->arch.mmio_gfn.

mmio_gfn isn't info about the fault though, it's info about the gfn being mmio.
So not fault scoped.

> 
> We don't need something like the vcpu->arch.mmio_gen because
> tdx->violation_gfn_* and tdx->violation_request_level are reset in each
> tdx_handle_ept_violation().
> 
> 
> BTW, dug into some history:
> 
> In v18 of TDX basic series,
> enforcing 4KB for pre-runnable faults were done by passing PG_LEVEL_4K to
> kvm_mmu_map_tdp_page().
> https://lore.kernel.org/all/1a64f798b550dad9e096603e8dae3b6e8fb2fbd5.1705965635.git.isaku.yamahata@intel.com/
> https://lore.kernel.org/all/97bb1f2996d8a7b828cd9e3309380d1a86ca681b.1705965635.git.isaku.yamahata@intel.com/
> 
> For the other faults, it's done by altering max_level in kvm_mmu_do_page_fault(),
> and Paolo asked to use the tdx_gmem_private_max_mapping_level() path.
> https://lore.kernel.org/all/CABgObfbu1-Ok607uYdo4DzwZf8ZGVQnvHU+y9_M1Zae55K5xwQ@mail.gmail.com/
> 
> For the patch "KVM: x86/mmu: Allow per-VM override of the TDP max page level",
> it's initially acked by Paolo in v2, and Sean's reply is at
> https://lore.kernel.org/all/YO3%2FgvK9A3tgYfT6@google.com .

The SNP case is not checking fault info, it's closer to the other cases. I don't
see that any of that conversation applies to this case. Can you clarify?

On the subject of the whether to pass accept level into the fault, or stuff it
on the vcpu, I'm still in the camp that it is better to pass it in the error
code. If you disagree, let's see if we can flag down Sean and Paolo to weigh in.