[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aGT3GlN4cPAcOcSL@yzhao56-desk.sh.intel.com>
Date: Wed, 2 Jul 2025 17:08:42 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
CC: "quic_eberman@...cinc.com" <quic_eberman@...cinc.com>, "Li, Xiaoyao"
<xiaoyao.li@...el.com>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>, "Hansen,
Dave" <dave.hansen@...el.com>, "david@...hat.com" <david@...hat.com>,
"thomas.lendacky@....com" <thomas.lendacky@....com>, "vbabka@...e.cz"
<vbabka@...e.cz>, "tabba@...gle.com" <tabba@...gle.com>, "Shutemov, Kirill"
<kirill.shutemov@...el.com>, "michael.roth@....com" <michael.roth@....com>,
"binbin.wu@...ux.intel.com" <binbin.wu@...ux.intel.com>, "seanjc@...gle.com"
<seanjc@...gle.com>, "Peng, Chao P" <chao.p.peng@...el.com>, "Du, Fan"
<fan.du@...el.com>, "Yamahata, Isaku" <isaku.yamahata@...el.com>,
"ackerleytng@...gle.com" <ackerleytng@...gle.com>, "Weiny, Ira"
<ira.weiny@...el.com>, "pbonzini@...hat.com" <pbonzini@...hat.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "Annapurve,
Vishal" <vannapurve@...gle.com>, "jroedel@...e.de" <jroedel@...e.de>, "Miao,
Jun" <jun.miao@...el.com>, "Li, Zhiquan1" <zhiquan1.li@...el.com>,
"pgonda@...gle.com" <pgonda@...gle.com>, "x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge
pages
On Wed, Jul 02, 2025 at 12:13:42AM +0800, Edgecombe, Rick P wrote:
> On Tue, 2025-07-01 at 13:01 +0800, Yan Zhao wrote:
> > > Maybe Yan can clarify here. I thought the HWpoison scenario was about TDX
> > > module
> > My thinking is to set HWPoison to private pages whenever KVM_BUG_ON() was hit
> > in
> > TDX. i.e., when the page is still mapped in S-EPT but the TD is bugged on and
> > about to tear down.
> >
> > So, it could be due to KVM or TDX module bugs, which retries can't help.
>
> We were going to call back into guestmemfd for this, right? Not set it inside
> KVM code.
Right. I think KVM calling back into guestmemf (via a special folio flag or API)
is better than KVM setting HWPoison flag or invoking memory_failure() or its
friends.
> What about a kvm_gmem_buggy_cleanup() instead of the system wide one. KVM calls
> it and then proceeds to bug the TD only from the KVM side. It's not as safe for
> the system, because who knows what a buggy TDX module could do. But TDX module
> could also be buggy without the kernel catching wind of it.
>
> Having a single callback to basically bug the fd would solve the atomic context
> issue. Then guestmemfd could dump the entire fd into memory_failure() instead of
> returning the pages. And developers could respond by fixing the bug.
Do you mean dumping the entire memory inside fd? Or just memory with certain
folio flags in the fd into memory_failure()?
> IMO maintainability needs to be balanced with efforts to minimize the fallout
> from bugs. In the end a system that is too complex is going to have more bugs
> anyway.
Agreed.
To me, having KVM to indicate memory corruption at a folio level (i.e. 2MB or 1GB
granularity) is acceptable.
KVM can set a flag (e.g. the flag proposed in
https://lore.kernel.org/all/aGN6GIFxh57ElHPA@yzhao56-desk.sh.intel.com).
guest_memfd can check this flag after every zap or after seeing
kvm_gmem_buggy_cleanup(). guest_memfd can choose to report memory_failure() or
leak the memory.
But I'm ok if you think dumping and memory_failure() the entire memory inside fd
is simpler.
> > > bugs. Not TDX busy errors, demote failures, etc. If there are "normal"
> > > failures,
> > > like the ones that can be fixed with retries, then I think HWPoison is not a
> > > good option though.
> > >
> > > > there is a way to make 100%
> > > > sure all memory becomes re-usable by the rest of the host, using
> > > > tdx_buggy_shutdown(), wbinvd, etc?
> >
> > Not sure about this approach. When TDX module is buggy and the page is still
> > accessible to guest as private pages, even with no-more SEAMCALLs flag, is it
> > safe enough for guest_memfd/hugetlb to re-assign the page to allow
> > simultaneous
> > access in shared memory with potential private access from TD or TDX module?
>
> With the no more seamcall's approach it should be safe (for the system). This is
> essentially what we are doing for kexec.
AFAIK, kexec stops devices first by invoking device's shutdown hook.
Similarly, "the no more seamcall's approach" should interact with devices to
avoid DMAs via private keys.
Powered by blists - more mailing lists