linux-kernel - Re: [RFC PATCH 12/21] KVM: TDX: Determine max mapping level according to vCPU's ACCEPT level

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aCrSKudi5mUVNcSv@yzhao56-desk.sh.intel.com>
Date: Mon, 19 May 2025 14:39:38 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
CC: "kvm@...r.kernel.org" <kvm@...r.kernel.org>, "Li, Xiaoyao"
	<xiaoyao.li@...el.com>, "quic_eberman@...cinc.com"
	<quic_eberman@...cinc.com>, "Hansen, Dave" <dave.hansen@...el.com>,
	"david@...hat.com" <david@...hat.com>, "Li, Zhiquan1"
	<zhiquan1.li@...el.com>, "tabba@...gle.com" <tabba@...gle.com>,
	"vbabka@...e.cz" <vbabka@...e.cz>, "thomas.lendacky@....com"
	<thomas.lendacky@....com>, "michael.roth@....com" <michael.roth@....com>,
	"seanjc@...gle.com" <seanjc@...gle.com>, "Weiny, Ira" <ira.weiny@...el.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"pbonzini@...hat.com" <pbonzini@...hat.com>, "ackerleytng@...gle.com"
	<ackerleytng@...gle.com>, "Yamahata, Isaku" <isaku.yamahata@...el.com>,
	"binbin.wu@...ux.intel.com" <binbin.wu@...ux.intel.com>, "Peng, Chao P"
	<chao.p.peng@...el.com>, "Du, Fan" <fan.du@...el.com>, "Annapurve, Vishal"
	<vannapurve@...gle.com>, "jroedel@...e.de" <jroedel@...e.de>, "Miao, Jun"
	<jun.miao@...el.com>, "Shutemov, Kirill" <kirill.shutemov@...el.com>,
	"pgonda@...gle.com" <pgonda@...gle.com>, "x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 12/21] KVM: TDX: Determine max mapping level
 according to vCPU's ACCEPT level

On Sat, May 17, 2025 at 06:02:14AM +0800, Edgecombe, Rick P wrote:
> On Fri, 2025-05-16 at 14:30 +0800, Yan Zhao wrote:
> > > Looking more closely, I don't see why it's too hard to pass in a
> > > max_fault_level
> > > into the fault struct. Totally untested rough idea, what do you think?
> > Thanks for bringing this up and providing the idea below. In the previous TDX
> > huge page v8, there's a similar implementation [1] [2].
> > 
> > This series did not adopt that approach because that approach requires
> > tdx_handle_ept_violation() to pass in max_fault_level, which is not always
> > available at that stage. e.g.
> > 
> > In patch 19, when vCPU 1 faults on a GFN at 2MB level and then vCPU 2 faults
> > on
> > the same GFN at 4KB level, TDX wants to ignore the demotion request caused by
> > vCPU 2's 4KB level fault. So, patch 19 sets tdx->violation_request_level to
> > 2MB
> > in vCPU 2's split callback and fails the split. vCPU 2's
> > __vmx_handle_ept_violation() will see RET_PF_RETRY and either do local retry
> > (or
> > return to the guest).
> 
> I think you mean patch 20 "KVM: x86: Force a prefetch fault's max mapping level
> to 4KB for TDX"?
Sorry. It's patch 21 "KVM: x86: Ignore splitting huge pages in fault path for
TDX"

> > 
> > If it retries locally, tdx_gmem_private_max_mapping_level() will return
> > tdx->violation_request_level, causing KVM to fault at 2MB level for vCPU 2,
> > resulting in a spurious fault, eventually returning to the guest.
> > 
> > As tdx->violation_request_level is per-vCPU and it resets in
> > tdx_get_accept_level() in tdx_handle_ept_violation() (meaning it resets after
> > each invocation of tdx_handle_ept_violation() and only affects the TDX local
> > retry loop), it should not hold any stale value.
> > 
> > Alternatively, instead of having tdx_gmem_private_max_mapping_level() to
> > return
> > tdx->violation_request_level, tdx_handle_ept_violation() could grab
> > tdx->violation_request_level as the max_fault_level to pass to
> > __vmx_handle_ept_violation().
> > 
> > This series chose to use tdx_gmem_private_max_mapping_level() to avoid
> > modification to the KVM MMU core.
> 
> It sounds like Kirill is suggesting we do have to have demotion in the fault
> path. IIRC it adds a lock, but the cost to skip fault path demotion seems to be
> adding up.
Yes, though Kirill is suggesting to support demotion in the fault path, I still
think that using tdx_gmem_private_max_mapping_level() might be more friendly to
other potential scenarios, such as when the KVM core MMU requests TDX to perform
page promotion, and TDX finds that promotion would consistently fail on a GFN.

Another important reason for not passing a max_fault_level into the fault struct
is that the KVM MMU now has the hook private_max_mapping_level to determine a
private fault's maximum level, which was introduced by commit f32fb32820b1
("KVM: x86: Add hook for determining max NPT mapping level"). We'd better not to
introduce another mechanism if the same job can be accomplished via the
private_max_mapping_level hook.

The code in TDX huge page v8 [1][2] simply inherited the old implementation from
its v1 [3], where the private_max_mapping_level hook had not yet been introduced
for private faults.

[1] https://lore.kernel.org/all/4d61104bff388a081ff8f6ae4ac71e05a13e53c3.1708933624.git.isaku.yamahata@intel.com/
[2] https://lore.kernel.org/all/3d2a6bfb033ee1b51f7b875360bd295376c32b54.1708933624.git.isaku.yamahata@intel.com/
[3] https://lore.kernel.org/all/cover.1659854957.git.isaku.yamahata@intel.com/