linux-kernel - Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aFp7j+2VBhHZiKx/@yzhao56-desk.sh.intel.com>
Date: Tue, 24 Jun 2025 18:18:55 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Ackerley Tng <ackerleytng@...gle.com>
CC: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>,
	"quic_eberman@...cinc.com" <quic_eberman@...cinc.com>, "Li, Xiaoyao"
	<xiaoyao.li@...el.com>, "Shutemov, Kirill" <kirill.shutemov@...el.com>,
	"Hansen, Dave" <dave.hansen@...el.com>, "david@...hat.com"
	<david@...hat.com>, "thomas.lendacky@....com" <thomas.lendacky@....com>,
	"vbabka@...e.cz" <vbabka@...e.cz>, "tabba@...gle.com" <tabba@...gle.com>,
	"Du, Fan" <fan.du@...el.com>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "seanjc@...gle.com" <seanjc@...gle.com>,
	"Weiny, Ira" <ira.weiny@...el.com>, "pbonzini@...hat.com"
	<pbonzini@...hat.com>, "binbin.wu@...ux.intel.com"
	<binbin.wu@...ux.intel.com>, "Yamahata, Isaku" <isaku.yamahata@...el.com>,
	"michael.roth@....com" <michael.roth@....com>, "Peng, Chao P"
	<chao.p.peng@...el.com>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
	"Annapurve, Vishal" <vannapurve@...gle.com>, "jroedel@...e.de"
	<jroedel@...e.de>, "Miao, Jun" <jun.miao@...el.com>, "Li, Zhiquan1"
	<zhiquan1.li@...el.com>, "pgonda@...gle.com" <pgonda@...gle.com>,
	"x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge
 pages

On Mon, Jun 23, 2025 at 03:48:48PM -0700, Ackerley Tng wrote:
> Ackerley Tng <ackerleytng@...gle.com> writes:
> 
> > Yan Zhao <yan.y.zhao@...el.com> writes:
> >
> >> On Wed, Jun 18, 2025 at 08:41:38AM +0800, Edgecombe, Rick P wrote:
> >>> On Wed, 2025-06-18 at 08:19 +0800, Yan Zhao wrote:
> >>> > > I don't think a potential bug in KVM is a good enough reason. If we are
> >>> > > concerned can we think about a warning instead?
> >>> > > 
> >>> > > We had talked enhancing kasan to know when a page is mapped into S-EPT in
> >>> > > the
> >>> > > past. So rather than design around potential bugs we could focus on having a
> >>> > > simpler implementation with the infrastructure to catch and fix the bugs.
> >>> > However, if failing to remove a guest private page would only cause memory
> >>> > leak,
> >>> > it's fine. 
> >>> > If TDX does not hold any refcount, guest_memfd has to know that which private
> >>> > page is still mapped. Otherwise, the page may be re-assigned to other kernel
> >>> > components while it may still be mapped in the S-EPT.
> >>> 
> >>> KASAN detects use-after-free's like that. However, the TDX module code is not
> >>> instrumented. It won't check against the KASAN state for it's accesses.
> >>> 
> >>> I had a brief chat about this with Dave and Kirill. A couple ideas were
> >>> discussed. One was to use page_ext to keep a flag that says the page is in-use
> >> Thanks!
> >>
> >> To use page_ext, should we introduce a new flag PAGE_EXT_FIRMWARE_IN_USE,
> >> similar to PAGE_EXT_YOUNG?
> >>
> >> Due to similar issues as those with normal page/folio flags (see the next
> >> comment for details), TDX needs to set PAGE_EXT_FIRMWARE_IN_USE on a
> >> page-by-page basis rather than folio-by-folio.
> >>
> >> Additionally, it seems reasonable for guest_memfd not to copy the
> >> PAGE_EXT_FIRMWARE_IN_USE flag when splitting a huge folio?
> >> (in __folio_split() --> split_folio_to_order(), PAGE_EXT_YOUNG and
> >> PAGE_EXT_IDLE are copied to the new folios though).
> >>
> >> Furthermore, page_ext uses extra memory. With CONFIG_64BIT, should we instead
> >> introduce a PG_firmware_in_use in page flags, similar to PG_young and PG_idle?
> >>
> 
> I think neither page flags nor page_ext will work for us, but see below.
> 
> >>> by the TDX module. There was also some discussion of using a normal page flag,
> >>> and that the reserved page flag might prevent some of the MM operations that
> >>> would be needed on guestmemfd pages. I didn't see the problem when I looked.
> >>> 
> >>> For the solution, basically the SEAMCALL wrappers set a flag when they hand a
> >>> page to the TDX module, and clear it when they successfully reclaim it via
> >>> tdh_mem_page_remove() or tdh_phymem_page_reclaim(). Then if the page makes it
> >>> back to the page allocator, a warning is generated.
> >> After some testing, to use a normal page flag, we may need to set it on a
> >> page-by-page basis rather than folio-by-folio. See "Scheme 1".
> >> And guest_memfd may need to selectively copy page flags when splitting huge
> >> folios. See "Scheme 2".
> >>
> >> Scheme 1: Set/unset page flag on folio-by-folio basis, i.e.
> >>         - set folio reserved at tdh_mem_page_aug(), tdh_mem_page_add(),
> >>         - unset folio reserved after a successful tdh_mem_page_remove() or
> >>           tdh_phymem_page_reclaim().
> >>
> >>         It has problem in following scenario:
> >>         1. tdh_mem_page_aug() adds a 2MB folio. It marks the folio as reserved
> >> 	   via "folio_set_reserved(page_folio(page))"
> >>
> >>         2. convert a 4KB page of the 2MB folio to shared.
> >>         2.1 tdh_mem_page_demote() is executed first.
> >>        
> >>         2.2 tdh_mem_page_remove() then removes the 4KB mapping.
> >>             "folio_clear_reserved(page_folio(page))" clears reserved flag for
> >>             the 2MB folio while the rest 511 pages are still mapped in the
> >>             S-EPT.
> >>
> >>         2.3. guest_memfd splits the 2MB folio into 512 4KB folios.
> >>
> >>
> 
> Folio flags on their own won't work because they're not precise
> enough. A folio can be multiple 4K pages, and if a 4K page had failed to
> unmap, we want to be able to indicate which 4K page had the failure,
> instead of the entire folio. (But see below)
> 
> >> Scheme 2: Set/unset page flag on page-by-page basis, i.e.
> >>         - set page flag reserved at tdh_mem_page_aug(), tdh_mem_page_add(),
> >>         - unset page flag reserved after a successful tdh_mem_page_remove() or
> >>           tdh_phymem_page_reclaim().
> >>
> >>         It has problem in following scenario:
> >>         1. tdh_mem_page_aug() adds a 2MB folio. It marks pages as reserved by
> >>            invoking "SetPageReserved()" on each page.
> >>            As the folio->flags equals to page[0]->flags, folio->flags is also
> >> 	   with reserved set.
> >>
> >>         2. convert a 4KB page of the 2MB folio to shared. say, it's page[4].
> >>         2.1 tdh_mem_page_demote() is executed first.
> >>        
> >>         2.2 tdh_mem_page_remove() then removes the 4KB mapping.
> >>             "ClearPageReserved()" clears reserved flag of page[4] of the 2MB
> >>             folio.
> >>
> >>         2.3. guest_memfd splits the 2MB folio into 512 4KB folios.
> >>              In guestmem_hugetlb_split_folio(), "p->flags = folio->flags" marks
> >>              page[4]->flags as reserved again as page[0] is still reserved.
> >>
> >>             (see the code in https://lore.kernel.org/all/2ae41e0d80339da2b57011622ac2288fed65cd01.1747264138.git.ackerleytng@google.com/
> >>             for (i = 1; i < orig_nr_pages; ++i) {
> >>                 struct page *p = folio_page(folio, i);
> >>
> >>                 /* Copy flags from the first page to split pages. */
> >>                 p->flags = folio->flags;
> >>
> >>                 p->mapping = NULL;
> >>                 clear_compound_head(p);
> >>             }
> >>             )
> >>
> 
> Per-page flags won't work because we want to retain HugeTLB Vmemmap
> Optimization (HVO), which allows subsequent (identical) struct pages to
> alias to each other. If we use a per-page flag, then HVO would break
> since struct pages would no longer be identical to each other.
Ah, I overlooked HVO.
In my testing, neither CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON nor
hugetlb_free_vmemmap was set.

With HVO enabled, setting page flags on a per-page basis indeed does not work.
> 
> >> [...]
> 
> Let me try and summarize the current state of this discussion:
Thanks for this summary.


> Topic 1: Does TDX need to somehow indicate that it is using a page?
> 
> This patch series uses refcounts to indicate that TDX is using a page,
> but that complicates private-to-shared conversions.
> 
> During a private-to-shared conversion, guest_memfd assumes that
> guest_memfd is trusted to manage private memory. TDX and other users
> should trust guest_memfd to keep the memory around.
> 
> Yan's position is that holding a refcount is in line with how IOMMU
> takes a refcount when a page is mapped into the IOMMU [1].
> 
> Yan had another suggestion, which is to indicate using a page flag [2].
> 
> I think we're in agreement that we don't want to have TDX hold a
> refcount while the page is mapped into the Secure EPTs, but taking a
> step back, do we really need to indicate (at all) that TDX is using a
> page?
> 
> In [3] Yan said
> 
> > If TDX does not hold any refcount, guest_memfd has to know that which
> > private
> > page is still mapped. Otherwise, the page may be re-assigned to other
> > kernel
> > components while it may still be mapped in the S-EPT.
> 
> If the private page is mapped for regular VM use as private memory,
> guest_memfd is managing that, and the same page will not be re-assigned
> to any other kernel component. guest_memfd does hold refcounts in
> guest_memfd's filemap.
After kvm_gmem_release(), guest_memfd will return folios to hugetlb, so the same
page could be re-assigned to other kernel components that allocate pages from
hugetlb.

> 
> If the private page is still mapped because there was an unmapping
> failure, we can discuss that separately under error handling in Topic 2.
> 
> With this, can I confirm that we are in agreement that TDX does not need
> to indicate that it is using a page, and can trust guest_memfd to keep
> the page around for the VM?
I thought it's not a must until I came across a comment from Sean:
"Should these bail early if the KVM_BUG_ON() is hit?  Calling into the TDX module
after bugging the VM is a bit odd."
https://lore.kernel.org/kvm/Z4r_XNcxPWpgjZio@google.com/#t.

This comment refers to the following scenario:
when a 2MB non-leaf entry in the mirror root is zapped with shared mmu_lock,
BUG_ON() will be triggered for TDX. But by the time handle_removed_pt() is
reached, the 2MB non-leaf entry would have been successfully removed in the
mirror root.

Bailing out early in remove_external_spte() would prevent the removal of 4KB
private guest pages in the S-EPT later due to lacking of corresponding entry in
the mirror root.

Since KVM MMU does not hold guest page's ref count, failing to notify TDX about
the removal of a guest page could result in a situation where a page still
mapped in the S-EPT is freed and re-allocated by the OS. 

Therefore, indicating that TDX is using a page can be less error-prone, though
it does consume more memory.

> Topic 2: How to handle unmapping/splitting errors arising from TDX?
> 
> Previously I was in favor of having unmap() return an error (Rick
> suggested doing a POC, and in a more recent email Rick asked for a
> diffstat), but Vishal and I talked about this and now I agree having
> unmapping return an error is not a good approach for these reasons.
> 
> 1. Unmapping takes a range, and within the range there could be more
>    than one unmapping error. I was previously thinking that unmap()
>    could return 0 for success and the failed PFN on error. Returning a
>    single PFN on error is okay-ish but if there are more errors it could
>    get complicated.
> 
>    Another error return option could be to return the folio where the
>    unmapping/splitting issue happened, but that would not be
>    sufficiently precise, since a folio could be larger than 4K and we
>    want to track errors as precisely as we can to reduce memory loss due
>    to errors.
> 
> 2. What I think Yan has been trying to say: unmap() returning an error
>    is non-standard in the kernel.
> 
> I think (1) is the dealbreaker here and there's no need to do the
> plumbing POC and diffstat.
> 
> So I think we're all in support of indicating unmapping/splitting issues
> without returning anything from unmap(), and the discussed options are
> 
> a. Refcounts: won't work - mostly discussed in this (sub-)thread
>    [3]. Using refcounts makes it impossible to distinguish between
>    transient refcounts and refcounts due to errors.
> 
> b. Page flags: won't work with/can't benefit from HVO.
> 
> Suggestions still in the running:
> 
> c. Folio flags are not precise enough to indicate which page actually
>    had an error, but this could be sufficient if we're willing to just
>    waste the rest of the huge page on unmapping error.
For 1GB folios, more precise info will be better.


> d. Folio flags with folio splitting on error. This means that on
>    unmapping/Secure EPT PTE splitting error, we have to split the
>    (larger than 4K) folio to 4K, and then set a flag on the split folio.
> 
>    The issue I see with this is that splitting pages with HVO applied
>    means doing allocations, and in an error scenario there may not be
>    memory left to split the pages.
Could we restore the page structures before triggering unmap?

> 
> e. Some other data structure in guest_memfd, say, a linked list, and a
>    function like kvm_gmem_add_error_pfn(struct page *page) that would
>    look up the guest_memfd inode from the page and add the page's pfn to
>    the linked list.
>
>    Everywhere in guest_memfd that does unmapping/splitting would then
>    check this linked list to see if the unmapping/splitting
>    succeeded.
> 
>    Everywhere in guest_memfd that allocates pages will also check this
>    linked list to make sure the pages are functional.
> 
>    When guest_memfd truncates, if the page being truncated is on the
>    list, retain the refcount on the page and leak that page.
>
> f. Combination of c and e, something similar to HugeTLB's
>    folio_set_hugetlb_hwpoison(), which sets a flag AND adds the pages in
>    trouble to a linked list on the folio.
That seems like a good idea. If memory allocation for the linked list succeeds,
mark the pages within a folio as troublesome; otherwise, mark the entire folio
as troublesome.

But maybe c is good enough for 2MB folios.

> g. Like f, but basically treat an unmapping error as hardware poisoning.
Not sure if hwpoison bit can be used directly.
Further investigation is needed.

> I'm kind of inclined towards g, to just treat unmapping errors as
> HWPOISON and buying into all the HWPOISON handling requirements. What do
> yall think? Can a TDX unmapping error be considered as memory poisoning?
> 
> 
> [1] https://lore.kernel.org/all/aE%2F1TgUvr0dcaJUg@yzhao56-desk.sh.intel.com/
> [2] https://lore.kernel.org/all/aFkeBtuNBN1RrDAJ@yzhao56-desk.sh.intel.com/
> [3] https://lore.kernel.org/all/aFIGFesluhuh2xAS@yzhao56-desk.sh.intel.com/
> [3] https://lore.kernel.org/all/aFJjZFFhrMWEPjQG@yzhao56-desk.sh.intel.com/