linux-kernel - Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e9c9470a71ed2c1a5b3715cc8dd5fce79309c5cf.camel@intel.com>
Date: Tue, 1 Jul 2025 16:14:40 +0000
From: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
To: "Annapurve, Vishal" <vannapurve@...gle.com>, "Zhao, Yan Y"
	<yan.y.zhao@...el.com>
CC: "Shutemov, Kirill" <kirill.shutemov@...el.com>, "Li, Xiaoyao"
	<xiaoyao.li@...el.com>, "Du, Fan" <fan.du@...el.com>, "Hansen, Dave"
	<dave.hansen@...el.com>, "david@...hat.com" <david@...hat.com>,
	"thomas.lendacky@....com" <thomas.lendacky@....com>, "tabba@...gle.com"
	<tabba@...gle.com>, "vbabka@...e.cz" <vbabka@...e.cz>, "kvm@...r.kernel.org"
	<kvm@...r.kernel.org>, "michael.roth@....com" <michael.roth@....com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"seanjc@...gle.com" <seanjc@...gle.com>, "Peng, Chao P"
	<chao.p.peng@...el.com>, "quic_eberman@...cinc.com"
	<quic_eberman@...cinc.com>, "Yamahata, Isaku" <isaku.yamahata@...el.com>,
	"ackerleytng@...gle.com" <ackerleytng@...gle.com>,
	"binbin.wu@...ux.intel.com" <binbin.wu@...ux.intel.com>, "Weiny, Ira"
	<ira.weiny@...el.com>, "pbonzini@...hat.com" <pbonzini@...hat.com>, "Li,
 Zhiquan1" <zhiquan1.li@...el.com>, "jroedel@...e.de" <jroedel@...e.de>,
	"Miao, Jun" <jun.miao@...el.com>, "pgonda@...gle.com" <pgonda@...gle.com>,
	"x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge
 pages

On Tue, 2025-07-01 at 06:32 -0700, Vishal Annapurve wrote:
> On Tue, Jul 1, 2025 at 2:38 AM Yan Zhao <yan.y.zhao@...el.com> wrote:
> > 
> > On Tue, Jul 01, 2025 at 01:55:43AM +0800, Edgecombe, Rick P wrote:
> > > So for this we can do something similar. Have the arch/x86 side of TDX grow a
> > > new tdx_buggy_shutdown(). Have it do an all-cpu IPI to kick CPUs out of
> > > SEAMMODE, wbivnd, and set a "no more seamcalls" bool. Then any SEAMCALLs after
> > > that will return a TDX_BUGGY_SHUTDOWN error, or similar. All TDs in the system
> > > die. Zap/cleanup paths return success in the buggy shutdown case.
> > All TDs in the system die could be too severe for unmap errors due to KVM bugs.
> 
> At this point, I don't see a way to quantify how bad a KVM bug can get
> unless you have explicit ideas about the severity. We should work on
> minimizing KVM side bugs too and assuming it would be a rare
> occurrence I think it's ok to take this intrusive measure.

Yes, it does seem on the line of "too severe". But keeping a list of pages to
release in a non-atomic context seems to complex for an error case that (still
not 100% clear) is theoretical.

In the argument of it's too severe, it's close to a BUG_ON() for the TDX side of
the kernel. But on the argument of it's not too severe, the system remains
stable.

> 
> > 
> > > Does it fit? Or, can you guys argue that the failures here are actually non-
> > > special cases that are worth more complex recovery? I remember we talked about
> > > IOMMU patterns that are similar, but it seems like the remaining cases under
> > > discussion are about TDX bugs.
> > I didn't mention TDX connect previously to avoid introducing unnecessary
> > complexity.
> > 
> > For TDX connect, S-EPT is used for private mappings in IOMMU. Unmap could
> > therefore fail due to pages being pinned for DMA.
> 
> We are discussing this scenario already[1], where the host will not
> pin the pages used by secure DMA for the same reasons why we can't
> have KVM pin the guest_memfd pages mapped in SEPT. Is there some other
> kind of pinning you are referring to?

I'm wondering about the "something went wrong and we can't invalidate" pattern.
Like the device refuses to cooperate.

> 
> If there is an ordering in which pages should be unmapped e.g. first
> in secure IOMMU and then KVM SEPT, then we can ensure the right
> ordering between invalidation callbacks from guest_memfd.
> 
> [1] https://lore.kernel.org/lkml/CAGtprH_qh8sEY3s-JucW3n1Wvoq7jdVZDDokvG5HzPf0HV2=pg@mail.gmail.com/#t

The general gist seems to be that guestmemfd should be the nerve center of these
decisions, and it should be given enough information to make the decisions to
invalidate only when success is guaranteed. Makes sense.

In this case we can't know the condition ahead of time. It is a TDX-only
problem? If it is, then we need to make TDX behave more like the others. Or have
simple to maintain cop-outs like this.

> 
> > 
> > So, my thinking was that if that happens, KVM could set a special flag to folios
> > pinned for private DMA.
> > 
> > Then guest_memfd could check the special flag before allowing private-to-shared
> > conversion, or punch hole.
> > guest_memfd could check this special flag and choose to poison or leak the
> > folio.
> > 
> > Otherwise, if we choose tdx_buggy_shutdown() to "do an all-cpu IPI to kick CPUs
> > out of SEAMMODE, wbivnd, and set a "no more seamcalls" bool", DMAs may still
> > have access to the private pages mapped in S-EPT.
> 
> guest_memfd will have to ensure that pages are unmapped from secure
> IOMMU pagetables before allowing them to be used by the host.
> 
> If secure IOMMU pagetables unmapping fails, I would assume it fails in
> the similar category of rare "KVM/TDX module/IOMMUFD" bug and I think
> it makes sense to do the same tdx_buggy_shutdown() with such failures
> as well.

It's too hypothetical to reason about. IMO, we need to know about specific
similar patterns to justify a more complex fine grained poisoning approach.