[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <53ea5239f8ef9d8df9af593647243c10435fd219.camel@intel.com>
Date: Fri, 11 Jul 2025 01:46:45 +0000
From: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
To: "ackerleytng@...gle.com" <ackerleytng@...gle.com>, "Zhao, Yan Y"
<yan.y.zhao@...el.com>
CC: "Shutemov, Kirill" <kirill.shutemov@...el.com>, "Li, Xiaoyao"
<xiaoyao.li@...el.com>, "Du, Fan" <fan.du@...el.com>, "Hansen, Dave"
<dave.hansen@...el.com>, "david@...hat.com" <david@...hat.com>,
"thomas.lendacky@....com" <thomas.lendacky@....com>, "vbabka@...e.cz"
<vbabka@...e.cz>, "Li, Zhiquan1" <zhiquan1.li@...el.com>,
"quic_eberman@...cinc.com" <quic_eberman@...cinc.com>, "michael.roth@....com"
<michael.roth@....com>, "seanjc@...gle.com" <seanjc@...gle.com>, "Weiny, Ira"
<ira.weiny@...el.com>, "Peng, Chao P" <chao.p.peng@...el.com>,
"pbonzini@...hat.com" <pbonzini@...hat.com>, "Yamahata, Isaku"
<isaku.yamahata@...el.com>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>, "tabba@...gle.com" <tabba@...gle.com>,
"kvm@...r.kernel.org" <kvm@...r.kernel.org>, "binbin.wu@...ux.intel.com"
<binbin.wu@...ux.intel.com>, "Annapurve, Vishal" <vannapurve@...gle.com>,
"jroedel@...e.de" <jroedel@...e.de>, "Miao, Jun" <jun.miao@...el.com>,
"pgonda@...gle.com" <pgonda@...gle.com>, "x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge
pages
On Tue, 2025-07-08 at 14:19 -0700, Ackerley Tng wrote:
> > Ok sounds good. Should we just continue the discussion there?
>
> I think we're at a point where further discussion isn't really
> useful. Kirill didn't seem worried about using HWpoison, so that's a
> good sign. I think we can go ahead to use HWpoison for the next RFC of
> this series and we might learn more through the process of testing it.
>
> Do you prefer to just wait till the next guest_memfd call (now
> rescheduled to 2025-07-17) before proceeding?
Ah, I missed this and joined the call. :)
At this point, I think I'm strongly in favor of not doing anything here.
Yan and I had a discussion on our internal team chat about this. I'll summarize:
Yan confirmed to me again, that there isn't a specific expected failure here. We
are talking about bugs generating the invalidation failure, and leaving the page
mapped. But in the case of a bug in a normal VM, a page can also be left mapped
too.
What is different here, is we have something (a return code) to check that could
catch some of the bugs. But this isn't the only case where a SEACMALL has a spec
defined error that we can't handle in a no-fail code path. In those other cases,
we handle them by making sure the error won't happen and trigger a VM_BUG_ON()
if it does anyway. We can be consistent by just doing the same thing in this
case. Implementing it looks like just removing the refcounting in the current
code.
And this VM_BUG_ON() will lead to a situation almost like unmapping anyway since
the TD can no longer be entered. With future VM shutdown work the pages will not
be zeroed at shutdown usually either. So we should not always expect crashes if
those pages are returned to the page allocator, even if a bug turns up.
Additionally KVM_BUG_ON() will leave a loud warning, allowing us to fix the bug.
But Yan raised a point that might be worth doing something for this case. On the
partial write errata platforms (a TDX specific thing), pages that are reclaimed
need to be zeroed. So to more cleanly handle this subset of catch-able bugs we
are focused on, we could zero the page after the KVM_BUG_ON(). But this still
need to be weighed with how much we want to add code to address potential bugs.
So on the benefit side, it is very low to me. The other side is the cost side,
which I think is maybe actually a stronger case. We can only make TDX a special
case too many times before we will run into upstream problems. Not to lean on
Sean here, but he bangs this drum. If we find that we have case where we have to
add any specialness for TDX (i.e. making it the only thing that sets the poison
bit manually), we should look at changing the TDX arch to address it. I'm not
sure what that looks like, but we haven't really tried too hard in that
direction yet.
So if TDX has a limited number of "gets to be special" cards, I don't think it
is prudent to spend it on something this much of an edge case. So our plan is to
rely on the KVM_BUG_ON() for now. And consider TDX arch changes (currently
unknown), for how to make the situation cleaner somehow.
Yan, is that your recollection? I guess the other points were that although TDX
doesn't need it today, for long term, userspace ABI around invalidations should
support failure. But the actual gmem/kvm interface for this can be figured out
later. And that external EPT specific TDP MMU code could be tweaked to make
things work a little safer around this.
Powered by blists - more mailing lists