linux-kernel - Re: [RFC PATCH 00/21] KVM: TDX huge page support for private memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <e735cpugrs3k5gncjcbjyycft3tuhkm75azpwv6ctwqfjr6gkg@rsf4lyq4gqoj>
Date: Thu, 24 Apr 2025 10:35:47 +0300
From: "Kirill A. Shutemov" <kirill@...temov.name>
To: Yan Zhao <yan.y.zhao@...el.com>
Cc: pbonzini@...hat.com, seanjc@...gle.com, linux-kernel@...r.kernel.org, 
	kvm@...r.kernel.org, x86@...nel.org, rick.p.edgecombe@...el.com, 
	dave.hansen@...el.com, kirill.shutemov@...el.com, tabba@...gle.com, 
	ackerleytng@...gle.com, quic_eberman@...cinc.com, michael.roth@....com, david@...hat.com, 
	vannapurve@...gle.com, vbabka@...e.cz, jroedel@...e.de, thomas.lendacky@....com, 
	pgonda@...gle.com, zhiquan1.li@...el.com, fan.du@...el.com, jun.miao@...el.com, 
	ira.weiny@...el.com, chao.p.peng@...el.com
Subject: Re: [RFC PATCH 00/21] KVM: TDX huge page support for private memory

On Thu, Apr 24, 2025 at 11:00:32AM +0800, Yan Zhao wrote:
> Basic huge page mapping/unmapping
> ---------------------------------
> - TD build time
>   This series enforces that all private mappings be 4KB during the TD build
>   phase, due to the TDX module's requirement that tdh_mem_page_add(), the
>   SEAMCALL for adding private pages during TD build time, only supports 4KB
>   mappings. Enforcing 4KB mappings also simplifies the implementation of
>   code for TD build time, by eliminating the need to consider merging or
>   splitting in the mirror page table during TD build time.
>   
>   The underlying pages allocated from guest_memfd during TD build time
>   phase can still be large, allowing for potential merging into 2MB
>   mappings once the TD is running.

It can be done before TD is running. The merging is allowed on TD build
stage.

But, yes, for simplicity we can skip it for initial enabling.

> Page splitting (page demotion)
> ------------------------------
> Page splitting occurs in two paths:
> (a) with exclusive kvm->mmu_lock, triggered by zapping operations,
> 
>     For normal VMs, if zapping a narrow region that would need to split a
>     huge page, KVM can simply zap the surrounding GFNs rather than
>     splitting a huge page. The pages can then be faulted back in, where KVM
>     can handle mapping them at a 4KB level.
> 
>     The reason why TDX can't use the normal VM solution is that zapping
>     private memory that is accepted cannot easily be re-faulted, since it
>     can only be re-faulted as unaccepted. So KVM will have to sometimes do
>     the page splitting as part of the zapping operations.
> 
>     These zapping operations can occur for few reasons:
>     1. VM teardown.
>     2. Memslot removal.
>     3. Conversion of private pages to shared.
>     4. Userspace does a hole punch to guest_memfd for some reason.
> 
>     For case 1 and 2, splitting before zapping is unnecessary because
>     either the entire range will be zapped or huge pages do not span
>     memslots.
>     
>     Case 3 or case 4 requires splitting, which is also followed by a
>     backend page splitting in guest_memfd.
> 
> (b) with shared kvm->mmu_lock, triggered by fault.
> 
>     Splitting in this path is not accompanied by a backend page splitting
>     (since backend page splitting necessitates a splitting and zapping
>      operation in the former path).  It is triggered when KVM finds that a
>     non-leaf entry is replacing a huge entry in the fault path, which is
>     usually caused by vCPUs' concurrent ACCEPT operations at different
>     levels.

Hm. This sounds like funky behaviour on the guest side.

You only saw it in a synthetic test, right? No real guest OS should do
this.

It can only be possible if guest is reckless enough to be exposed to
double accept attacks.

We should consider putting a warning if we detect such case on KVM side.

>     This series simply ignores the splitting request in the fault path to
>     avoid unnecessary bounces between levels. The vCPU that performs ACCEPT
>     at a lower level would finally figures out the page has been accepted
>     at a higher level by another vCPU.
> 
>     A rare case that could lead to splitting in the fault path is when a TD
>     is configured to receive #VE and accesses memory before the ACCEPT
>     operation. By the time a vCPU accesses a private GFN, due to the lack
>     of any guest preferred level, KVM could create a mapping at 2MB level.
>     If the TD then only performs the ACCEPT operation at 4KB level,
>     splitting in the fault path will be triggered. However, this is not
>     regarded as a typical use case, as usually TD always accepts pages in
>     the order from 1GB->2MB->4KB. The worst outcome to ignore the resulting
>     splitting request is an endless EPT violation. This would not happen
>     for a Linux guest, which does not expect any #VE.

Even if guest accepts memory in response to #VE, it still has to serialize
ACCEPT requests to the same memory block. And track what has been
accepted.

Double accept is a guest bug.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov