lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aLILRk6252a3-iKJ@google.com>
Date: Fri, 29 Aug 2025 13:19:18 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Rick P Edgecombe <rick.p.edgecombe@...el.com>
Cc: "pbonzini@...hat.com" <pbonzini@...hat.com>, Kai Huang <kai.huang@...el.com>, 
	"ackerleytng@...gle.com" <ackerleytng@...gle.com>, Vishal Annapurve <vannapurve@...gle.com>, 
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, Yan Y Zhao <yan.y.zhao@...el.com>, 
	Ira Weiny <ira.weiny@...el.com>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>, 
	"michael.roth@....com" <michael.roth@....com>
Subject: Re: [RFC PATCH v2 05/18] KVM: TDX: Drop superfluous page pinning in
 S-EPT management

On Fri, Aug 29, 2025, Rick P Edgecombe wrote:
> On Thu, 2025-08-28 at 17:06 -0700, Sean Christopherson wrote:
> > From: Yan Zhao <yan.y.zhao@...el.com>
> > 
> > Don't explicitly pin pages when mapping pages into the S-EPT, guest_memfd
> > doesn't support page migration in any capacity, i.e. there are no migrate
> > callbacks because guest_memfd pages *can't* be migrated.  See the WARN in
> > kvm_gmem_migrate_folio().
> > 
> > Eliminating TDX's explicit pinning will also enable guest_memfd to support
> > in-place conversion between shared and private memory[1][2].  Because KVM
> > cannot distinguish between speculative/transient refcounts and the
> > intentional refcount for TDX on private pages[3], failing to release
> > private page refcount in TDX could cause guest_memfd to indefinitely wait
> > on decreasing the refcount for the splitting.
> > 
> > Under normal conditions, not holding an extra page refcount in TDX is safe
> > because guest_memfd ensures pages are retained until its invalidation
> > notification to KVM MMU is completed. However, if there're bugs in KVM/TDX
> > module, not holding an extra refcount when a page is mapped in S-EPT could
> > result in a page being released from guest_memfd while still mapped in the
> > S-EPT.  But, doing work to make a fatal error slightly less fatal is a net
> > negative when that extra work adds complexity and confusion.
> > 
> > Several approaches were considered to address the refcount issue, including
> >   - Attempting to modify the KVM unmap operation to return a failure,
> >     which was deemed too complex and potentially incorrect[4].
> >  - Increasing the folio reference count only upon S-EPT zapping failure[5].
> >  - Use page flags or page_ext to indicate a page is still used by TDX[6],
> >    which does not work for HVO (HugeTLB Vmemmap Optimization).
> >   - Setting HWPOISON bit or leveraging folio_set_hugetlb_hwpoison()[7].
> > 
> > Due to the complexity or inappropriateness of these approaches, and the
> > fact that S-EPT zapping failure is currently only possible when there are
> > bugs in the KVM or TDX module, which is very rare in a production kernel,
> > a straightforward approach of simply not holding the page reference count
> > in TDX was chosen[8].
> > 
> > When S-EPT zapping errors occur, KVM_BUG_ON() is invoked to kick off all
> > vCPUs and mark the VM as dead. Although there is a potential window that a
> > private page mapped in the S-EPT could be reallocated and used outside the
> > VM, the loud warning from KVM_BUG_ON() should provide sufficient debug
> > information.
> > 
> 
> Yea, in the case of a bug, there could be a use-after-free. This logic applies
> to all code that has allocations including the entire KVM MMU. But in this case,
> we can actually catch the use-after-free scenario under scrutiny and not have it
> happen silently, which does not apply to all code. But the special case here is
> that the use-after-free depends on TDX module logic which is not part of the
> kernel.
> 
> Yan, can you clarify what you mean by "there could be a small window"? I'm
> thinking this is a hypothetical window around vm_dead races? Or more concrete? I
> *don't* want to re-open the debate on whether to go with this approach, but I
> think this is a good teaching edge case to settle on how we want to treat
> similar issues. So I just want to make sure we have the justification right.

The first paragraph is all the justification we need.  Seriously.  Bad things
will happen if you have UAF bugs, news at 11!

I'm all for defensive programming, but pinning pages goes too far, because that
itself can be dangerous, e.g. see commit 2bcb52a3602b ("KVM: Pin (as in FOLL_PIN)
pages during kvm_vcpu_map()") and the many messes KVM created with respect to
struct page refcounts.

I'm happy to include more context in the changelog, but I really don't want
anyone to walk away from this thinking that pinning pages in random KVM code is
at all encouraged.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ