linux-kernel - Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <04d3e455d07042a0ab8e244e6462d9011c914581.camel@intel.com>
Date: Tue, 1 Jul 2025 22:37:21 +0000
From: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
To: "ackerleytng@...gle.com" <ackerleytng@...gle.com>, "Zhao, Yan Y"
	<yan.y.zhao@...el.com>
CC: "quic_eberman@...cinc.com" <quic_eberman@...cinc.com>, "Li, Xiaoyao"
	<xiaoyao.li@...el.com>, "Li, Zhiquan1" <zhiquan1.li@...el.com>, "Du, Fan"
	<fan.du@...el.com>, "Hansen, Dave" <dave.hansen@...el.com>,
	"david@...hat.com" <david@...hat.com>, "thomas.lendacky@....com"
	<thomas.lendacky@....com>, "tabba@...gle.com" <tabba@...gle.com>,
	"vbabka@...e.cz" <vbabka@...e.cz>, "Shutemov, Kirill"
	<kirill.shutemov@...el.com>, "michael.roth@....com" <michael.roth@....com>,
	"seanjc@...gle.com" <seanjc@...gle.com>, "Weiny, Ira" <ira.weiny@...el.com>,
	"Peng, Chao P" <chao.p.peng@...el.com>, "binbin.wu@...ux.intel.com"
	<binbin.wu@...ux.intel.com>, "Yamahata, Isaku" <isaku.yamahata@...el.com>,
	"pbonzini@...hat.com" <pbonzini@...hat.com>, "kvm@...r.kernel.org"
	<kvm@...r.kernel.org>, "Annapurve, Vishal" <vannapurve@...gle.com>,
	"jroedel@...e.de" <jroedel@...e.de>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "Miao, Jun" <jun.miao@...el.com>,
	"pgonda@...gle.com" <pgonda@...gle.com>, "x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge
 pages

On Tue, 2025-07-01 at 14:48 -0700, Ackerley Tng wrote:
> Perhaps we had different understandings of f/g :P

Ah yes, I thought you were saying that guestmemfd would use poison internally
via some gmem_buggy_page() or similar. I guess I thought it is more of
guestmemfd's job. But as Yan pointed out, we need to handle non gmem page errors
too. Currently we leak, but it would be nice to keep the handling symmetrical.
Which would be easier if we did it all in TDX code.

> 
> I meant that TDX module should directly set the HWpoison flag on the
> folio (HugeTLB or 4K, guest_memfd or not), not call into guest_memfd.
> 
> guest_memfd will then check this flag when necessary, specifically:
> 
> * On faults, either into guest or host page tables 
> * When freeing the page
>     * guest_memfd will not return HugeTLB pages that are poisoned to
>       HugeTLB and just leak it
>     * 4K pages will be freed normally, because free_pages_prepare() will
>       check for HWpoison and skip freeing, from __folio_put() ->
>       free_frozen_pages() -> __free_frozen_pages() ->
>       free_pages_prepare()
> * I believe guest_memfd doesn't need to check HWpoison on conversions [1]
> 
> [1] https://lore.kernel.org/all/diqz5xghjca4.fsf@ackerleytng-ctop.c.googlers.com/

If a poisoned page continued to be used, it's a bit weird, no? It could take an
#MC for another reason from userspace and the handling code would see the page
flag is already set. If it doesn't already trip up some MM code somewhere, it
might put undue burden on the memory failure code to have to expect repeated
poisoning of the same memory.

> 
> > What about a kvm_gmem_buggy_cleanup() instead of the system wide one. KVM calls
> > it and then proceeds to bug the TD only from the KVM side. It's not as safe for
> > the system, because who knows what a buggy TDX module could do. But TDX module
> > could also be buggy without the kernel catching wind of it.
> > 
> > Having a single callback to basically bug the fd would solve the atomic context
> > issue. Then guestmemfd could dump the entire fd into memory_failure() instead of
> > returning the pages. And developers could respond by fixing the bug.
> > 
> 
> This could work too.
> 
> I'm in favor of buying into the HWpoison system though, since we're
> quite sure this is fair use of HWpoison.

Do you mean manually setting the poison flag, or calling into memory_failure(),
and friends? If we set them manually, we need to make sure that it does not have
side effects on the machine check handler. It seems risky/messy to me. But
Kirill didn't seem worried.

Maybe we could bring the poison page flag up to DavidH and see if there is any
concern before going down this path too far?

> 
> Are you saying kvm_gmem_buggy_cleanup() will just set the HWpoison flag
> on the parts of the folios in trouble?

I was saying kvm_gmem_buggy_cleanup() can set a bool on the fd, similar to
VM_BUG_ON() setting vm_dead. After an invalidate, if gmem see this, it needs to
assume everything failed, and invalidate everything and poison all guest memory.
The point was to have the simplest possible handling for a rare error. Although
it's only a proposal. The TDX emergency shutdown option may be simpler still.
But killing all TDs is not ideal. So thought we could at least consider other
options.

If we have a solution where TDX needs to do something complicated because
something of its specialness, it may get NAKed. This is my main concern with the
direction of this problem/solution. AFAICT, we are not even sure of a concrete
problem, and it appears to be special to TDX. So the complexity budget should be
small. It's in sharp contrast to the length of the discussion.