linux-kernel - Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aHCdRF10S0fU/EY2@yzhao56-desk>
Date: Fri, 11 Jul 2025 13:12:36 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
CC: "ackerleytng@...gle.com" <ackerleytng@...gle.com>, "Shutemov, Kirill"
	<kirill.shutemov@...el.com>, "Li, Xiaoyao" <xiaoyao.li@...el.com>, "Du, Fan"
	<fan.du@...el.com>, "Hansen, Dave" <dave.hansen@...el.com>,
	"david@...hat.com" <david@...hat.com>, "thomas.lendacky@....com"
	<thomas.lendacky@....com>, "vbabka@...e.cz" <vbabka@...e.cz>, "Li, Zhiquan1"
	<zhiquan1.li@...el.com>, "quic_eberman@...cinc.com"
	<quic_eberman@...cinc.com>, "michael.roth@....com" <michael.roth@....com>,
	"seanjc@...gle.com" <seanjc@...gle.com>, "Weiny, Ira" <ira.weiny@...el.com>,
	"Peng, Chao P" <chao.p.peng@...el.com>, "pbonzini@...hat.com"
	<pbonzini@...hat.com>, "Yamahata, Isaku" <isaku.yamahata@...el.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"tabba@...gle.com" <tabba@...gle.com>, "kvm@...r.kernel.org"
	<kvm@...r.kernel.org>, "binbin.wu@...ux.intel.com"
	<binbin.wu@...ux.intel.com>, "Annapurve, Vishal" <vannapurve@...gle.com>,
	"jroedel@...e.de" <jroedel@...e.de>, "Miao, Jun" <jun.miao@...el.com>,
	"pgonda@...gle.com" <pgonda@...gle.com>, "x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge
 pages

On Fri, Jul 11, 2025 at 09:46:45AM +0800, Edgecombe, Rick P wrote:
> On Tue, 2025-07-08 at 14:19 -0700, Ackerley Tng wrote:
> > > Ok sounds good. Should we just continue the discussion there?
> > 
> > I think we're at a point where further discussion isn't really
> > useful. Kirill didn't seem worried about using HWpoison, so that's a
> > good sign. I think we can go ahead to use HWpoison for the next RFC of
> > this series and we might learn more through the process of testing it.
> > 
> > Do you prefer to just wait till the next guest_memfd call (now
> > rescheduled to 2025-07-17) before proceeding?
> 
> Ah, I missed this and joined the call. :)
> 
> At this point, I think I'm strongly in favor of not doing anything here.
> 
> Yan and I had a discussion on our internal team chat about this. I'll summarize:
> 
> Yan confirmed to me again, that there isn't a specific expected failure here. We
> are talking about bugs generating the invalidation failure, and leaving the page
> mapped. But in the case of a bug in a normal VM, a page can also be left mapped
> too.
> 
> What is different here, is we have something (a return code) to check that could
> catch some of the bugs. But this isn't the only case where a SEACMALL has a spec
> defined error that we can't handle in a no-fail code path. In those other cases,
> we handle them by making sure the error won't happen and trigger a VM_BUG_ON()
> if it does anyway. We can be consistent by just doing the same thing in this
> case. Implementing it looks like just removing the refcounting in the current
> code.
> 
> And this VM_BUG_ON() will lead to a situation almost like unmapping anyway since
> the TD can no longer be entered. With future VM shutdown work the pages will not
> be zeroed at shutdown usually either. So we should not always expect crashes if
> those pages are returned to the page allocator, even if a bug turns up.
> Additionally KVM_BUG_ON() will leave a loud warning, allowing us to fix the bug.
> 
> But Yan raised a point that might be worth doing something for this case. On the
> partial write errata platforms (a TDX specific thing), pages that are reclaimed
> need to be zeroed. So to more cleanly handle this subset of catch-able bugs we
> are focused on, we could zero the page after the KVM_BUG_ON(). But this still
> need to be weighed with how much we want to add code to address potential bugs.
> 
> 
> So on the benefit side, it is very low to me. The other side is the cost side,
> which I think is maybe actually a stronger case. We can only make TDX a special
> case too many times before we will run into upstream problems. Not to lean on
> Sean here, but he bangs this drum. If we find that we have case where we have to
> add any specialness for TDX (i.e. making it the only thing that sets the poison
> bit manually), we should look at changing the TDX arch to address it. I'm not
> sure what that looks like, but we haven't really tried too hard in that
> direction yet.
> 
> So if TDX has a limited number of "gets to be special" cards, I don't think it
> is prudent to spend it on something this much of an edge case. So our plan is to
> rely on the KVM_BUG_ON() for now. And consider TDX arch changes (currently
> unknown), for how to make the situation cleaner somehow.
> 
> Yan, is that your recollection? I guess the other points were that although TDX
I'm ok if KVM_BUG_ON() is considered loud enough to warn about the rare
potential corruption, thereby making TDX less special.

> doesn't need it today, for long term, userspace ABI around invalidations should
> support failure. But the actual gmem/kvm interface for this can be figured out
Could we elaborate what're included in userspace ABI around invalidations?

I'm a bit confused as I think the userspace ABI today supports failure already.

Currently, the unmap API between gmem and KVM does not support failure.

In the future, we hope gmem can check if KVM allows a page to be unmapped before
triggering the actual unmap. 

> later. And that external EPT specific TDP MMU code could be tweaked to make
> things work a little safer around this.