linux-kernel - Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c69ed125c25cd3b7f7400ed3ef9206cd56ebe3c9.camel@intel.com>
Date: Tue, 24 Jun 2025 22:00:56 +0000
From: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
To: "ackerleytng@...gle.com" <ackerleytng@...gle.com>, "Zhao, Yan Y"
	<yan.y.zhao@...el.com>
CC: "quic_eberman@...cinc.com" <quic_eberman@...cinc.com>, "Li, Xiaoyao"
	<xiaoyao.li@...el.com>, "Du, Fan" <fan.du@...el.com>, "Hansen, Dave"
	<dave.hansen@...el.com>, "david@...hat.com" <david@...hat.com>,
	"thomas.lendacky@....com" <thomas.lendacky@....com>, "tabba@...gle.com"
	<tabba@...gle.com>, "vbabka@...e.cz" <vbabka@...e.cz>, "Shutemov, Kirill"
	<kirill.shutemov@...el.com>, "michael.roth@....com" <michael.roth@....com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"seanjc@...gle.com" <seanjc@...gle.com>, "pbonzini@...hat.com"
	<pbonzini@...hat.com>, "binbin.wu@...ux.intel.com"
	<binbin.wu@...ux.intel.com>, "Yamahata, Isaku" <isaku.yamahata@...el.com>,
	"Peng, Chao P" <chao.p.peng@...el.com>, "Weiny, Ira" <ira.weiny@...el.com>,
	"kvm@...r.kernel.org" <kvm@...r.kernel.org>, "Annapurve, Vishal"
	<vannapurve@...gle.com>, "jroedel@...e.de" <jroedel@...e.de>, "Miao, Jun"
	<jun.miao@...el.com>, "Li, Zhiquan1" <zhiquan1.li@...el.com>,
	"pgonda@...gle.com" <pgonda@...gle.com>, "x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge
 pages

On Mon, 2025-06-23 at 15:48 -0700, Ackerley Tng wrote:
> Let me try and summarize the current state of this discussion:
> 
> Topic 1: Does TDX need to somehow indicate that it is using a page?
> 
> This patch series uses refcounts to indicate that TDX is using a page,
> but that complicates private-to-shared conversions.
> 
> During a private-to-shared conversion, guest_memfd assumes that
> guest_memfd is trusted to manage private memory. TDX and other users
> should trust guest_memfd to keep the memory around.
> 
> Yan's position is that holding a refcount is in line with how IOMMU
> takes a refcount when a page is mapped into the IOMMU [1].
> 
> Yan had another suggestion, which is to indicate using a page flag [2].
> 
> I think we're in agreement that we don't want to have TDX hold a
> refcount while the page is mapped into the Secure EPTs, but taking a
> step back, do we really need to indicate (at all) that TDX is using a
> page?
> 
> In [3] Yan said
> 
> > If TDX does not hold any refcount, guest_memfd has to know that which
> > private
> > page is still mapped. Otherwise, the page may be re-assigned to other
> > kernel
> > components while it may still be mapped in the S-EPT.
> 
> If the private page is mapped for regular VM use as private memory,
> guest_memfd is managing that, and the same page will not be re-assigned
> to any other kernel component. guest_memfd does hold refcounts in
> guest_memfd's filemap.
> 
> If the private page is still mapped because there was an unmapping
> failure, we can discuss that separately under error handling in Topic 2.
> 
> With this, can I confirm that we are in agreement that TDX does not need
> to indicate that it is using a page, and can trust guest_memfd to keep
> the page around for the VM?

Minor correction here. Yan was concerned about *bugs* happening when freeing
pages that are accidentally still mapped in the S-EPT. My opinion is that this
is not especially risky to happen here vs other similar places, but it could be
helpful if there was a way to catch such bugs. The page flag, or page_ext
direction came out of a discussion with Dave and Kirill. If it could run all the
time that would be great, but if not a debug config could be sufficient. For
example like CONFIG_PAGE_TABLE_CHECK. It doesn't need to support vmemmap
optimizations because the debug checking doesn't need to run all the time.
Overhead for debug settings is very normal.

> 
> Topic 2: How to handle unmapping/splitting errors arising from TDX?
> 
> Previously I was in favor of having unmap() return an error (Rick
> suggested doing a POC, and in a more recent email Rick asked for a
> diffstat), but Vishal and I talked about this and now I agree having
> unmapping return an error is not a good approach for these reasons.

Ok, let's close this option then.

> 
> 1. Unmapping takes a range, and within the range there could be more
>    than one unmapping error. I was previously thinking that unmap()
>    could return 0 for success and the failed PFN on error. Returning a
>    single PFN on error is okay-ish but if there are more errors it could
>    get complicated.
> 
>    Another error return option could be to return the folio where the
>    unmapping/splitting issue happened, but that would not be
>    sufficiently precise, since a folio could be larger than 4K and we
>    want to track errors as precisely as we can to reduce memory loss due
>    to errors.
> 
> 2. What I think Yan has been trying to say: unmap() returning an error
>    is non-standard in the kernel.
> 
> I think (1) is the dealbreaker here and there's no need to do the
> plumbing POC and diffstat.
> 
> So I think we're all in support of indicating unmapping/splitting issues
> without returning anything from unmap(), and the discussed options are
> 
> a. Refcounts: won't work - mostly discussed in this (sub-)thread
>    [3]. Using refcounts makes it impossible to distinguish between
>    transient refcounts and refcounts due to errors.
> 
> b. Page flags: won't work with/can't benefit from HVO.

As above, this was for the purpose of catching bugs, not for guestmemfd to
logically depend on it.

> 
> Suggestions still in the running:
> 
> c. Folio flags are not precise enough to indicate which page actually
>    had an error, but this could be sufficient if we're willing to just
>    waste the rest of the huge page on unmapping error.

For a scenario of TDX module bug, it seems ok to me.

> 
> d. Folio flags with folio splitting on error. This means that on
>    unmapping/Secure EPT PTE splitting error, we have to split the
>    (larger than 4K) folio to 4K, and then set a flag on the split folio.
> 
>    The issue I see with this is that splitting pages with HVO applied
>    means doing allocations, and in an error scenario there may not be
>    memory left to split the pages.
> 
> e. Some other data structure in guest_memfd, say, a linked list, and a
>    function like kvm_gmem_add_error_pfn(struct page *page) that would
>    look up the guest_memfd inode from the page and add the page's pfn to
>    the linked list.
> 
>    Everywhere in guest_memfd that does unmapping/splitting would then
>    check this linked list to see if the unmapping/splitting
>    succeeded.
> 
>    Everywhere in guest_memfd that allocates pages will also check this
>    linked list to make sure the pages are functional.
> 
>    When guest_memfd truncates, if the page being truncated is on the
>    list, retain the refcount on the page and leak that page.

I think this is a fine option.

> 
> f. Combination of c and e, something similar to HugeTLB's
>    folio_set_hugetlb_hwpoison(), which sets a flag AND adds the pages in
>    trouble to a linked list on the folio.
> 
> g. Like f, but basically treat an unmapping error as hardware poisoning.
> 
> I'm kind of inclined towards g, to just treat unmapping errors as
> HWPOISON and buying into all the HWPOISON handling requirements. What do
> yall think? Can a TDX unmapping error be considered as memory poisoning?

What does HWPOISON bring over refcounting the page/folio so that it never
returns to the page allocator? We are bugging the TD in these cases. Ohhh... Is
this about the code to allow gmem fds to be handed to new VMs?