linux-kernel - Re: [RFC PATCH 00/21] KVM: TDX huge page support for private memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aAn3SSocw0XvaRye@yzhao56-desk.sh.intel.com>
Date: Thu, 24 Apr 2025 16:33:13 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: "Kirill A. Shutemov" <kirill@...temov.name>
CC: <pbonzini@...hat.com>, <seanjc@...gle.com>,
	<linux-kernel@...r.kernel.org>, <kvm@...r.kernel.org>, <x86@...nel.org>,
	<rick.p.edgecombe@...el.com>, <dave.hansen@...el.com>,
	<kirill.shutemov@...el.com>, <tabba@...gle.com>, <ackerleytng@...gle.com>,
	<quic_eberman@...cinc.com>, <michael.roth@....com>, <david@...hat.com>,
	<vannapurve@...gle.com>, <vbabka@...e.cz>, <jroedel@...e.de>,
	<thomas.lendacky@....com>, <pgonda@...gle.com>, <zhiquan1.li@...el.com>,
	<fan.du@...el.com>, <jun.miao@...el.com>, <ira.weiny@...el.com>,
	<chao.p.peng@...el.com>
Subject: Re: [RFC PATCH 00/21] KVM: TDX huge page support for private memory

On Thu, Apr 24, 2025 at 10:35:47AM +0300, Kirill A. Shutemov wrote:
> On Thu, Apr 24, 2025 at 11:00:32AM +0800, Yan Zhao wrote:
> > Basic huge page mapping/unmapping
> > ---------------------------------
> > - TD build time
> >   This series enforces that all private mappings be 4KB during the TD build
> >   phase, due to the TDX module's requirement that tdh_mem_page_add(), the
> >   SEAMCALL for adding private pages during TD build time, only supports 4KB
> >   mappings. Enforcing 4KB mappings also simplifies the implementation of
> >   code for TD build time, by eliminating the need to consider merging or
> >   splitting in the mirror page table during TD build time.
> >   
> >   The underlying pages allocated from guest_memfd during TD build time
> >   phase can still be large, allowing for potential merging into 2MB
> >   mappings once the TD is running.
> 
> It can be done before TD is running. The merging is allowed on TD build
> stage.
> 
> But, yes, for simplicity we can skip it for initial enabling.
Yes, to avoid complicating kvm_tdx->nr_premapped calculation.
I also don't see any benefit to allow merging during TD build stage.

> 
> > Page splitting (page demotion)
> > ------------------------------
> > Page splitting occurs in two paths:
> > (a) with exclusive kvm->mmu_lock, triggered by zapping operations,
> > 
> >     For normal VMs, if zapping a narrow region that would need to split a
> >     huge page, KVM can simply zap the surrounding GFNs rather than
> >     splitting a huge page. The pages can then be faulted back in, where KVM
> >     can handle mapping them at a 4KB level.
> > 
> >     The reason why TDX can't use the normal VM solution is that zapping
> >     private memory that is accepted cannot easily be re-faulted, since it
> >     can only be re-faulted as unaccepted. So KVM will have to sometimes do
> >     the page splitting as part of the zapping operations.
> > 
> >     These zapping operations can occur for few reasons:
> >     1. VM teardown.
> >     2. Memslot removal.
> >     3. Conversion of private pages to shared.
> >     4. Userspace does a hole punch to guest_memfd for some reason.
> > 
> >     For case 1 and 2, splitting before zapping is unnecessary because
> >     either the entire range will be zapped or huge pages do not span
> >     memslots.
> >     
> >     Case 3 or case 4 requires splitting, which is also followed by a
> >     backend page splitting in guest_memfd.
> > 
> > (b) with shared kvm->mmu_lock, triggered by fault.
> > 
> >     Splitting in this path is not accompanied by a backend page splitting
> >     (since backend page splitting necessitates a splitting and zapping
> >      operation in the former path).  It is triggered when KVM finds that a
> >     non-leaf entry is replacing a huge entry in the fault path, which is
> >     usually caused by vCPUs' concurrent ACCEPT operations at different
> >     levels.
> 
> Hm. This sounds like funky behaviour on the guest side.
> 
> You only saw it in a synthetic test, right? No real guest OS should do
> this.
Right. In selftest only.
Also in case of any guest bugs.

> It can only be possible if guest is reckless enough to be exposed to
> double accept attacks.
> 
> We should consider putting a warning if we detect such case on KVM side.
Is it acceptable to put warnings in host kernel in case of guest bugs or
attacks?


> >     This series simply ignores the splitting request in the fault path to
> >     avoid unnecessary bounces between levels. The vCPU that performs ACCEPT
> >     at a lower level would finally figures out the page has been accepted
> >     at a higher level by another vCPU.
> > 
> >     A rare case that could lead to splitting in the fault path is when a TD
> >     is configured to receive #VE and accesses memory before the ACCEPT
> >     operation. By the time a vCPU accesses a private GFN, due to the lack
> >     of any guest preferred level, KVM could create a mapping at 2MB level.
> >     If the TD then only performs the ACCEPT operation at 4KB level,
> >     splitting in the fault path will be triggered. However, this is not
> >     regarded as a typical use case, as usually TD always accepts pages in
> >     the order from 1GB->2MB->4KB. The worst outcome to ignore the resulting
> >     splitting request is an endless EPT violation. This would not happen
> >     for a Linux guest, which does not expect any #VE.
> 
> Even if guest accepts memory in response to #VE, it still has to serialize
> ACCEPT requests to the same memory block. And track what has been
> accepted.
> 
> Double accept is a guest bug.
In the rare case, there're no double accept.
1. Guest acceses a private GPA
2. KVM creates a 2MB mapping in PENDING state and returns to guest.
3. Guest re-accesses, causing the TDX module to inject a #VE.
4. Guest accepts at 4KB level only.
5. EPT violation to KVM for page splitting.

Here, we expect a normal guest to accept from GB->2MB->4KB in step 4.