linux-kernel - Re: [RFC PATCH 00/21] KVM: TDX huge page support for private memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aAoJHXsAbdOx+ljo@yzhao56-desk.sh.intel.com>
Date: Thu, 24 Apr 2025 17:49:17 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: "Kirill A. Shutemov" <kirill@...temov.name>
CC: <pbonzini@...hat.com>, <seanjc@...gle.com>,
	<linux-kernel@...r.kernel.org>, <kvm@...r.kernel.org>, <x86@...nel.org>,
	<rick.p.edgecombe@...el.com>, <dave.hansen@...el.com>,
	<kirill.shutemov@...el.com>, <tabba@...gle.com>, <ackerleytng@...gle.com>,
	<quic_eberman@...cinc.com>, <michael.roth@....com>, <david@...hat.com>,
	<vannapurve@...gle.com>, <vbabka@...e.cz>, <jroedel@...e.de>,
	<thomas.lendacky@....com>, <pgonda@...gle.com>, <zhiquan1.li@...el.com>,
	<fan.du@...el.com>, <jun.miao@...el.com>, <ira.weiny@...el.com>,
	<chao.p.peng@...el.com>
Subject: Re: [RFC PATCH 00/21] KVM: TDX huge page support for private memory

On Thu, Apr 24, 2025 at 12:05:15PM +0300, Kirill A. Shutemov wrote:
> On Thu, Apr 24, 2025 at 04:33:13PM +0800, Yan Zhao wrote:
> > On Thu, Apr 24, 2025 at 10:35:47AM +0300, Kirill A. Shutemov wrote:
> > > On Thu, Apr 24, 2025 at 11:00:32AM +0800, Yan Zhao wrote:
> > > > Basic huge page mapping/unmapping
> > > > ---------------------------------
> > > > - TD build time
> > > >   This series enforces that all private mappings be 4KB during the TD build
> > > >   phase, due to the TDX module's requirement that tdh_mem_page_add(), the
> > > >   SEAMCALL for adding private pages during TD build time, only supports 4KB
> > > >   mappings. Enforcing 4KB mappings also simplifies the implementation of
> > > >   code for TD build time, by eliminating the need to consider merging or
> > > >   splitting in the mirror page table during TD build time.
> > > >   
> > > >   The underlying pages allocated from guest_memfd during TD build time
> > > >   phase can still be large, allowing for potential merging into 2MB
> > > >   mappings once the TD is running.
> > > 
> > > It can be done before TD is running. The merging is allowed on TD build
> > > stage.
> > > 
> > > But, yes, for simplicity we can skip it for initial enabling.
> > Yes, to avoid complicating kvm_tdx->nr_premapped calculation.
> > I also don't see any benefit to allow merging during TD build stage.
> > 
> > > 
> > > > Page splitting (page demotion)
> > > > ------------------------------
> > > > Page splitting occurs in two paths:
> > > > (a) with exclusive kvm->mmu_lock, triggered by zapping operations,
> > > > 
> > > >     For normal VMs, if zapping a narrow region that would need to split a
> > > >     huge page, KVM can simply zap the surrounding GFNs rather than
> > > >     splitting a huge page. The pages can then be faulted back in, where KVM
> > > >     can handle mapping them at a 4KB level.
> > > > 
> > > >     The reason why TDX can't use the normal VM solution is that zapping
> > > >     private memory that is accepted cannot easily be re-faulted, since it
> > > >     can only be re-faulted as unaccepted. So KVM will have to sometimes do
> > > >     the page splitting as part of the zapping operations.
> > > > 
> > > >     These zapping operations can occur for few reasons:
> > > >     1. VM teardown.
> > > >     2. Memslot removal.
> > > >     3. Conversion of private pages to shared.
> > > >     4. Userspace does a hole punch to guest_memfd for some reason.
> > > > 
> > > >     For case 1 and 2, splitting before zapping is unnecessary because
> > > >     either the entire range will be zapped or huge pages do not span
> > > >     memslots.
> > > >     
> > > >     Case 3 or case 4 requires splitting, which is also followed by a
> > > >     backend page splitting in guest_memfd.
> > > > 
> > > > (b) with shared kvm->mmu_lock, triggered by fault.
> > > > 
> > > >     Splitting in this path is not accompanied by a backend page splitting
> > > >     (since backend page splitting necessitates a splitting and zapping
> > > >      operation in the former path).  It is triggered when KVM finds that a
> > > >     non-leaf entry is replacing a huge entry in the fault path, which is
> > > >     usually caused by vCPUs' concurrent ACCEPT operations at different
> > > >     levels.
> > > 
> > > Hm. This sounds like funky behaviour on the guest side.
> > > 
> > > You only saw it in a synthetic test, right? No real guest OS should do
> > > this.
> > Right. In selftest only.
> > Also in case of any guest bugs.
> > 
> > > It can only be possible if guest is reckless enough to be exposed to
> > > double accept attacks.
> > > 
> > > We should consider putting a warning if we detect such case on KVM side.
> > Is it acceptable to put warnings in host kernel in case of guest bugs or
> > attacks?
> 
> pr_warn_once() shouldn't be a big deal.
My previous learning is that even a per-VM warning is not desired.
Maybe Rick or anyone else could chime in.
Compared to warning, what about an exit to userspace for further handling?

Another thing is that there may not be an easy way for KVM to differentiate if a
splitting is caused by two competing 4K vs 2M ACCEPT requests or an operation you
deemed valid below. Guests could turn on #VE dynamically.

> > > >     This series simply ignores the splitting request in the fault path to
> > > >     avoid unnecessary bounces between levels. The vCPU that performs ACCEPT
> > > >     at a lower level would finally figures out the page has been accepted
> > > >     at a higher level by another vCPU.
> > > > 
> > > >     A rare case that could lead to splitting in the fault path is when a TD
> > > >     is configured to receive #VE and accesses memory before the ACCEPT
> > > >     operation. By the time a vCPU accesses a private GFN, due to the lack
> > > >     of any guest preferred level, KVM could create a mapping at 2MB level.
> > > >     If the TD then only performs the ACCEPT operation at 4KB level,
> > > >     splitting in the fault path will be triggered. However, this is not
> > > >     regarded as a typical use case, as usually TD always accepts pages in
> > > >     the order from 1GB->2MB->4KB. The worst outcome to ignore the resulting
> > > >     splitting request is an endless EPT violation. This would not happen
> > > >     for a Linux guest, which does not expect any #VE.
> > > 
> > > Even if guest accepts memory in response to #VE, it still has to serialize
> > > ACCEPT requests to the same memory block. And track what has been
> > > accepted.
> > > 
> > > Double accept is a guest bug.
> > In the rare case, there're no double accept.
> > 1. Guest acceses a private GPA
> > 2. KVM creates a 2MB mapping in PENDING state and returns to guest.
> > 3. Guest re-accesses, causing the TDX module to inject a #VE.
> > 4. Guest accepts at 4KB level only.
> > 5. EPT violation to KVM for page splitting.
> > 
> > Here, we expect a normal guest to accept from GB->2MB->4KB in step 4.
> 
> Okay, I think I misunderstood this case. I thought there is competing 4k
> vs 2M ACCEPT requests to the same memory block.
> 
> Accepting everything at 4k level is a stupid, but valid behaviour on the
> guest behalf. This splitting case has to be supported before the patchset
> hits the mainline.
Hmm. If you think this is a valid behavior, patches to introduce more locks
in TDX are required :)

Not sure about the value of supporting it though, as it's also purely
hypothetical and couldn't exist in Linux guests.

> BTW, there's no 1G ACCEPT. I know that guest is written as if it is a
> thing, but TDX module only supports 4k and 2M. 1G is only reachable via
> promotion.
Ah, you are right. I re-checked the TDX module code, yes, it returns error on
1G level.