linux-kernel - Re: [RFC PATCH 00/21] KVM: TDX huge page support for private memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6vdj4mfxlyvypn743klxq5twda66tkugwzljdt275rug2gmwwl@zdziylxpre6y>
Date: Thu, 24 Apr 2025 12:05:15 +0300
From: "Kirill A. Shutemov" <kirill@...temov.name>
To: Yan Zhao <yan.y.zhao@...el.com>
Cc: pbonzini@...hat.com, seanjc@...gle.com, linux-kernel@...r.kernel.org, 
	kvm@...r.kernel.org, x86@...nel.org, rick.p.edgecombe@...el.com, 
	dave.hansen@...el.com, kirill.shutemov@...el.com, tabba@...gle.com, 
	ackerleytng@...gle.com, quic_eberman@...cinc.com, michael.roth@....com, david@...hat.com, 
	vannapurve@...gle.com, vbabka@...e.cz, jroedel@...e.de, thomas.lendacky@....com, 
	pgonda@...gle.com, zhiquan1.li@...el.com, fan.du@...el.com, jun.miao@...el.com, 
	ira.weiny@...el.com, chao.p.peng@...el.com
Subject: Re: [RFC PATCH 00/21] KVM: TDX huge page support for private memory

On Thu, Apr 24, 2025 at 04:33:13PM +0800, Yan Zhao wrote:
> On Thu, Apr 24, 2025 at 10:35:47AM +0300, Kirill A. Shutemov wrote:
> > On Thu, Apr 24, 2025 at 11:00:32AM +0800, Yan Zhao wrote:
> > > Basic huge page mapping/unmapping
> > > ---------------------------------
> > > - TD build time
> > >   This series enforces that all private mappings be 4KB during the TD build
> > >   phase, due to the TDX module's requirement that tdh_mem_page_add(), the
> > >   SEAMCALL for adding private pages during TD build time, only supports 4KB
> > >   mappings. Enforcing 4KB mappings also simplifies the implementation of
> > >   code for TD build time, by eliminating the need to consider merging or
> > >   splitting in the mirror page table during TD build time.
> > >   
> > >   The underlying pages allocated from guest_memfd during TD build time
> > >   phase can still be large, allowing for potential merging into 2MB
> > >   mappings once the TD is running.
> > 
> > It can be done before TD is running. The merging is allowed on TD build
> > stage.
> > 
> > But, yes, for simplicity we can skip it for initial enabling.
> Yes, to avoid complicating kvm_tdx->nr_premapped calculation.
> I also don't see any benefit to allow merging during TD build stage.
> 
> > 
> > > Page splitting (page demotion)
> > > ------------------------------
> > > Page splitting occurs in two paths:
> > > (a) with exclusive kvm->mmu_lock, triggered by zapping operations,
> > > 
> > >     For normal VMs, if zapping a narrow region that would need to split a
> > >     huge page, KVM can simply zap the surrounding GFNs rather than
> > >     splitting a huge page. The pages can then be faulted back in, where KVM
> > >     can handle mapping them at a 4KB level.
> > > 
> > >     The reason why TDX can't use the normal VM solution is that zapping
> > >     private memory that is accepted cannot easily be re-faulted, since it
> > >     can only be re-faulted as unaccepted. So KVM will have to sometimes do
> > >     the page splitting as part of the zapping operations.
> > > 
> > >     These zapping operations can occur for few reasons:
> > >     1. VM teardown.
> > >     2. Memslot removal.
> > >     3. Conversion of private pages to shared.
> > >     4. Userspace does a hole punch to guest_memfd for some reason.
> > > 
> > >     For case 1 and 2, splitting before zapping is unnecessary because
> > >     either the entire range will be zapped or huge pages do not span
> > >     memslots.
> > >     
> > >     Case 3 or case 4 requires splitting, which is also followed by a
> > >     backend page splitting in guest_memfd.
> > > 
> > > (b) with shared kvm->mmu_lock, triggered by fault.
> > > 
> > >     Splitting in this path is not accompanied by a backend page splitting
> > >     (since backend page splitting necessitates a splitting and zapping
> > >      operation in the former path).  It is triggered when KVM finds that a
> > >     non-leaf entry is replacing a huge entry in the fault path, which is
> > >     usually caused by vCPUs' concurrent ACCEPT operations at different
> > >     levels.
> > 
> > Hm. This sounds like funky behaviour on the guest side.
> > 
> > You only saw it in a synthetic test, right? No real guest OS should do
> > this.
> Right. In selftest only.
> Also in case of any guest bugs.
> 
> > It can only be possible if guest is reckless enough to be exposed to
> > double accept attacks.
> > 
> > We should consider putting a warning if we detect such case on KVM side.
> Is it acceptable to put warnings in host kernel in case of guest bugs or
> attacks?

pr_warn_once() shouldn't be a big deal.

> > >     This series simply ignores the splitting request in the fault path to
> > >     avoid unnecessary bounces between levels. The vCPU that performs ACCEPT
> > >     at a lower level would finally figures out the page has been accepted
> > >     at a higher level by another vCPU.
> > > 
> > >     A rare case that could lead to splitting in the fault path is when a TD
> > >     is configured to receive #VE and accesses memory before the ACCEPT
> > >     operation. By the time a vCPU accesses a private GFN, due to the lack
> > >     of any guest preferred level, KVM could create a mapping at 2MB level.
> > >     If the TD then only performs the ACCEPT operation at 4KB level,
> > >     splitting in the fault path will be triggered. However, this is not
> > >     regarded as a typical use case, as usually TD always accepts pages in
> > >     the order from 1GB->2MB->4KB. The worst outcome to ignore the resulting
> > >     splitting request is an endless EPT violation. This would not happen
> > >     for a Linux guest, which does not expect any #VE.
> > 
> > Even if guest accepts memory in response to #VE, it still has to serialize
> > ACCEPT requests to the same memory block. And track what has been
> > accepted.
> > 
> > Double accept is a guest bug.
> In the rare case, there're no double accept.
> 1. Guest acceses a private GPA
> 2. KVM creates a 2MB mapping in PENDING state and returns to guest.
> 3. Guest re-accesses, causing the TDX module to inject a #VE.
> 4. Guest accepts at 4KB level only.
> 5. EPT violation to KVM for page splitting.
> 
> Here, we expect a normal guest to accept from GB->2MB->4KB in step 4.

Okay, I think I misunderstood this case. I thought there is competing 4k
vs 2M ACCEPT requests to the same memory block.

Accepting everything at 4k level is a stupid, but valid behaviour on the
guest behalf. This splitting case has to be supported before the patchset
hits the mainline.

BTW, there's no 1G ACCEPT. I know that guest is written as if it is a
thing, but TDX module only supports 4k and 2M. 1G is only reachable via
promotion.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov