linux-kernel - Re: [RFC PATCH v2 05/18] KVM: TDX: Drop superfluous page pinning in S-EPT management

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <aLlRlbaq84IRvNPv@google.com>
Date: Thu, 4 Sep 2025 01:45:09 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Rick P Edgecombe <rick.p.edgecombe@...el.com>
Cc: Yan Y Zhao <yan.y.zhao@...el.com>, Kai Huang <kai.huang@...el.com>, 
	"ackerleytng@...gle.com" <ackerleytng@...gle.com>, Vishal Annapurve <vannapurve@...gle.com>, 
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, Ira Weiny <ira.weiny@...el.com>, 
	"kvm@...r.kernel.org" <kvm@...r.kernel.org>, "michael.roth@....com" <michael.roth@....com>, 
	"pbonzini@...hat.com" <pbonzini@...hat.com>
Subject: Re: [RFC PATCH v2 05/18] KVM: TDX: Drop superfluous page pinning in
 S-EPT management

On Tue, Sep 02, 2025, Rick P Edgecombe wrote:
> On Tue, 2025-09-02 at 10:33 -0700, Sean Christopherson wrote:
> > > Besides, a cache flush after 2 can essentially cause a memory write to the
> > > page.
> > > Though we could invoke tdh_phymem_page_wbinvd_hkid() after the KVM_BUG_ON(),
> > > the SEAMCALL itself can fail.
> > 
> > I think this falls into the category of "don't screw up" flows.  Failure to
> > remove a private SPTE is a near-catastrophic error.  Going out of our way to
> > reduce the impact of such errors increases complexity without providing much
> > in the way of value.
> > 
> > E.g. if VMCLEAR fails, KVM WARNs but continues on and hopes for the best, even
> > though there's a decent chance failure to purge the VMCS cache entry could be
> > lead to UAF-like problems.  To me, this is largely the same.
> > 
> > If anything, we should try to prevent #2, e.g. by marking the entire
> > guest_memfd as broken or something, and then deliberately leaking _all_ pages.
> 
> There was a marathon thread on this subject.

Holy moly, you weren't kidding.

> We did discuss this option (link to
> most relevant part I could find):
> https://lore.kernel.org/kvm/a9affa03c7cdc8109d0ed6b5ca30ec69269e2f34.camel@intel.com/
> 
> The high level summary is that pinning the pages wrinkles guestmemfd's plans to
> use refcount for other tracking purposes. Dropping refcounts interferes with the
> error handling safety.

It also bakes even more assumptions into TDX about guest_memfd being backed with
"struct page", which I would like to avoid doing whenever possible.

> I strongly agree that we should not optimize for the error path at all. If we
> could bug the guestmemfd (kind of what we were discussing in that link) I think
> it would be appropriate to use in these cases. I guess the question is are we ok
> dropping the safety before we have a solution like that.

Definitely a "yes" from me.  For this to actually cause real world problems, we'd
need a critical KVM, hardware, or TDX-Module bug, and several unlikely events to
all line up.

If someone encounters any of these KVM_BUG_ON()s _and_ has observed that the
probability of data corruption is meaningful, then we can always convert one or
more of these to full BUG_ON() conditions, but I don't see any reason to do that
without strong evidence that it's necessary.

> In that thread I was advocating for yes, partly to close it because the
> conversation was getting stuck. But there is probably a long tail of
> potential issues or ways of looking at it that could put it in the grey area.