linux-kernel - Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <a40d2c0105652dfcc01169775d6852bd4729c0a3.camel@intel.com>
Date: Mon, 30 Jun 2025 17:55:43 +0000
From: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
To: "ackerleytng@...gle.com" <ackerleytng@...gle.com>, "Zhao, Yan Y"
	<yan.y.zhao@...el.com>
CC: "Shutemov, Kirill" <kirill.shutemov@...el.com>, "Li, Xiaoyao"
	<xiaoyao.li@...el.com>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>, "Hansen,
 Dave" <dave.hansen@...el.com>, "david@...hat.com" <david@...hat.com>,
	"thomas.lendacky@....com" <thomas.lendacky@....com>, "tabba@...gle.com"
	<tabba@...gle.com>, "vbabka@...e.cz" <vbabka@...e.cz>,
	"quic_eberman@...cinc.com" <quic_eberman@...cinc.com>, "michael.roth@....com"
	<michael.roth@....com>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "seanjc@...gle.com" <seanjc@...gle.com>,
	"Peng, Chao P" <chao.p.peng@...el.com>, "Du, Fan" <fan.du@...el.com>,
	"Yamahata, Isaku" <isaku.yamahata@...el.com>, "pbonzini@...hat.com"
	<pbonzini@...hat.com>, "binbin.wu@...ux.intel.com"
	<binbin.wu@...ux.intel.com>, "Weiny, Ira" <ira.weiny@...el.com>, "Li,
 Zhiquan1" <zhiquan1.li@...el.com>, "Annapurve, Vishal"
	<vannapurve@...gle.com>, "jroedel@...e.de" <jroedel@...e.de>, "Miao, Jun"
	<jun.miao@...el.com>, "pgonda@...gle.com" <pgonda@...gle.com>,
	"x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge
 pages

On Mon, 2025-06-30 at 19:13 +0800, Yan Zhao wrote:
> > > ok! Lets go f/g. Unless Yan objects.
> I'm ok with f/g. But I have two implementation specific questions:
> 
> 1. How to set the HWPoison bit in TDX?
> 2. Should we set this bit for non-guest-memfd pages (e.g. for S-EPT pages) ?

Argh, I guess we can keep the existing ref count based approach for the other
types of TDX owned pages?

> 
> TDX can't invoke memory_failure() on error of removing guest private pages or
> S-EPT pages, because holding write mmu_lock is regarded as in atomic context.
> As there's a mutex in memory_failure(),
> "BUG: sleeping function called from invalid context at kernel/locking/mutex.c"
> will be printed.
> 
> If TDX invokes memory_failure_queue() instead, looks guest_memfd can invoke
> memory_failure_queue_kick() to ensure HWPoison bit is set timely.
> But which component could invoke memory_failure_queue_kick() for S-EPT pages?
> KVM?

Hmm, it only has queue of 10 pages per-cpu. If something goes wrong in the TDX
module, I could see exceeding this during a zap operation. At which point, how
much have we really handled it?

But, at the risk of derailing the solution when we are close, some reflection
has made me question whether this is all misprioritized. We are trying to handle
a case where a TDX module bug may return an error when we try to release gmem
pages. For that, this solution is feeling way too complex.

If there is a TDX module bug, a simpler way to handle it would be to fix the
bug. In the meantime the kernel can take simpler, more drastic efforts to
reclaim the memory and ensure system stability.

In the host kexec patches we need to handle a kexec while the TDX module is
running. The solution is to simply wbinvd on each pCPU that might have entered
the TDX module. After that, barring no new SEAMCALLs that could dirty
memory, the pages are free to use by the next kernel. (at least on systems
without the partial write errata)

So for this we can do something similar. Have the arch/x86 side of TDX grow a
new tdx_buggy_shutdown(). Have it do an all-cpu IPI to kick CPUs out of
SEAMMODE, wbivnd, and set a "no more seamcalls" bool. Then any SEAMCALLs after
that will return a TDX_BUGGY_SHUTDOWN error, or similar. All TDs in the system
die. Zap/cleanup paths return success in the buggy shutdown case.

Does it fit? Or, can you guys argue that the failures here are actually non-
special cases that are worth more complex recovery? I remember we talked about
IOMMU patterns that are similar, but it seems like the remaining cases under
discussion are about TDX bugs.