linux-kernel - Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aFugbDVJJDrK4n9V@yzhao56-desk.sh.intel.com>
Date: Wed, 25 Jun 2025 15:08:28 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Ackerley Tng <ackerleytng@...gle.com>
CC: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>,
	"quic_eberman@...cinc.com" <quic_eberman@...cinc.com>, "Li, Xiaoyao"
	<xiaoyao.li@...el.com>, "Du, Fan" <fan.du@...el.com>, "Hansen, Dave"
	<dave.hansen@...el.com>, "david@...hat.com" <david@...hat.com>,
	"thomas.lendacky@....com" <thomas.lendacky@....com>, "tabba@...gle.com"
	<tabba@...gle.com>, "vbabka@...e.cz" <vbabka@...e.cz>, "Shutemov, Kirill"
	<kirill.shutemov@...el.com>, "michael.roth@....com" <michael.roth@....com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"seanjc@...gle.com" <seanjc@...gle.com>, "pbonzini@...hat.com"
	<pbonzini@...hat.com>, "binbin.wu@...ux.intel.com"
	<binbin.wu@...ux.intel.com>, "Yamahata, Isaku" <isaku.yamahata@...el.com>,
	"Peng, Chao P" <chao.p.peng@...el.com>, "Weiny, Ira" <ira.weiny@...el.com>,
	"kvm@...r.kernel.org" <kvm@...r.kernel.org>, "Annapurve, Vishal"
	<vannapurve@...gle.com>, "jroedel@...e.de" <jroedel@...e.de>, "Miao, Jun"
	<jun.miao@...el.com>, "Li, Zhiquan1" <zhiquan1.li@...el.com>,
	"pgonda@...gle.com" <pgonda@...gle.com>, "x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge
 pages

On Tue, Jun 24, 2025 at 04:30:57PM -0700, Ackerley Tng wrote:
> "Edgecombe, Rick P" <rick.p.edgecombe@...el.com> writes:
> 
> > On Mon, 2025-06-23 at 15:48 -0700, Ackerley Tng wrote:
> >> Let me try and summarize the current state of this discussion:
> >> 
> >> Topic 1: Does TDX need to somehow indicate that it is using a page?
> >> 
> >> This patch series uses refcounts to indicate that TDX is using a page,
> >> but that complicates private-to-shared conversions.
> >> 
> >> During a private-to-shared conversion, guest_memfd assumes that
> >> guest_memfd is trusted to manage private memory. TDX and other users
> >> should trust guest_memfd to keep the memory around.
> >> 
> >> Yan's position is that holding a refcount is in line with how IOMMU
> >> takes a refcount when a page is mapped into the IOMMU [1].
> >> 
> >> Yan had another suggestion, which is to indicate using a page flag [2].
> >> 
> >> I think we're in agreement that we don't want to have TDX hold a
> >> refcount while the page is mapped into the Secure EPTs, but taking a
> >> step back, do we really need to indicate (at all) that TDX is using a
> >> page?
> >> 
> >> In [3] Yan said
> >> 
> >> > If TDX does not hold any refcount, guest_memfd has to know that which
> >> > private
> >> > page is still mapped. Otherwise, the page may be re-assigned to other
> >> > kernel
> >> > components while it may still be mapped in the S-EPT.
> >> 
> >> If the private page is mapped for regular VM use as private memory,
> >> guest_memfd is managing that, and the same page will not be re-assigned
> >> to any other kernel component. guest_memfd does hold refcounts in
> >> guest_memfd's filemap.
> >> 
> >> If the private page is still mapped because there was an unmapping
> >> failure, we can discuss that separately under error handling in Topic 2.
> >> 
> >> With this, can I confirm that we are in agreement that TDX does not need
> >> to indicate that it is using a page, and can trust guest_memfd to keep
> >> the page around for the VM?
> >
> > Minor correction here. Yan was concerned about *bugs* happening when freeing
> > pages that are accidentally still mapped in the S-EPT. My opinion is that this
> > is not especially risky to happen here vs other similar places, but it could be
> > helpful if there was a way to catch such bugs. The page flag, or page_ext
> > direction came out of a discussion with Dave and Kirill. If it could run all the
> > time that would be great, but if not a debug config could be sufficient. For
> > example like CONFIG_PAGE_TABLE_CHECK. It doesn't need to support vmemmap
> > optimizations because the debug checking doesn't need to run all the time.
> > Overhead for debug settings is very normal.
> >
> 
> I see, let's call debug checking Topic 3 then, to separate it from Topic
> 1, which is TDX indicating that it is using a page for production
> kernels.
> 
> Topic 3: How should TDX indicate use of a page for debugging?
> 
> I'm okay if for debugging, TDX uses anything other than refcounts for
> checking, because refcounts will interfere with conversions.
> 
> Rick's other email is correct. The correct link should be
> https://lore.kernel.org/all/aFJjZFFhrMWEPjQG@yzhao56-desk.sh.intel.com/.
> 
> [INTERFERE WITH CONVERSIONS]
> 
> To summarize, if TDX uses refcounts to indicate that it is using a page,
> or to indicate anything else, then we cannot easily split a page on
> private to shared conversions.
> 
> Specifically, consider the case where only the x-th subpage of a huge
> folio is mapped into Secure-EPTs. When the guest requests to convert
> some subpage to shared, the huge folio has to be split for
> core-mm. Core-mm, which will use the shared page, must have split folios
> to be able to accurately and separately track refcounts for subpages.
> 
> During splitting, guest_memfd would see refcount of 512 (for 2M page
> being in the filemap) + 1 (if TDX indicates that the x-th subpage is
> mapped using a refcount), but would not be able to tell that the 513th
> refcount belongs to the x-th subpage. guest_memfd can't split the huge
> folio unless it knows how to distribute the 513th refcount.
In my POC, https://lore.kernel.org/all/aE%2Fq9VKkmaCcuwpU@yzhao56-desk.sh.intel.com/
kvm_gmem_private_has_safe_refcount() was introduce to check the folio ref count.
It rejects private-to-shared conversion after splitting and unmapping KVM's
secondary page tables if the refcount exceeds a valid threshold.

Though in
https://lore.kernel.org/all/aFJjZFFhrMWEPjQG@yzhao56-desk.sh.intel.com/, we
agreed that "EAGAIN is not the right code in case of "extra" refcounts held by
TDX", this does not imply that rejecting the conversion itself is incorrect.

This is why we are exploring alternative solutions instead of having TDX hold
the page refcount.

So, either a per-page flag, per-folio flag or solutions e,f,g should be good.

IMO, regardless of the final choice, guest_memfd needs to identify problematic
folios to:
- reject the private-to-shared conversion
- prevent further recycling after kvm_gmem_free_folio()

> One might say guest_memfd could clear all the refcounts that TDX is
> holding on the huge folio by unmapping the entire huge folio from the
> Secure-EPTs, but unmapping the entire huge folio for TDX means zeroing
> the contents and requiring guest re-acceptance. Both of these would mess
> up guest operation.
> 
> Hence, guest_memfd's solution is to require that users of guest_memfd
> for private memory trust guest_memfd to maintain the pages around and
> not take any refcounts.
> 
> So back to Topic 1, for production kernels, is it okay that TDX does not
> need to indicate that it is using a page, and can trust guest_memfd to
> keep the page around for the VM?
If the "TDX does not need to indicate that it is using a page" means "do not
take page refcount", I'm ok.

> >> 
> >> Topic 2: How to handle unmapping/splitting errors arising from TDX?
> >> 
> >> Previously I was in favor of having unmap() return an error (Rick
> >> suggested doing a POC, and in a more recent email Rick asked for a
> >> diffstat), but Vishal and I talked about this and now I agree having
> >> unmapping return an error is not a good approach for these reasons.
> >
> > Ok, let's close this option then.
> >
> >> 
> >> 1. Unmapping takes a range, and within the range there could be more
> >>    than one unmapping error. I was previously thinking that unmap()
> >>    could return 0 for success and the failed PFN on error. Returning a
> >>    single PFN on error is okay-ish but if there are more errors it could
> >>    get complicated.
> >> 
> >>    Another error return option could be to return the folio where the
> >>    unmapping/splitting issue happened, but that would not be
> >>    sufficiently precise, since a folio could be larger than 4K and we
> >>    want to track errors as precisely as we can to reduce memory loss due
> >>    to errors.
> >> 
> >> 2. What I think Yan has been trying to say: unmap() returning an error
> >>    is non-standard in the kernel.
> >> 
> >> I think (1) is the dealbreaker here and there's no need to do the
> >> plumbing POC and diffstat.
> >> 
> >> So I think we're all in support of indicating unmapping/splitting issues
> >> without returning anything from unmap(), and the discussed options are
> >> 
> >> a. Refcounts: won't work - mostly discussed in this (sub-)thread
> >>    [3]. Using refcounts makes it impossible to distinguish between
> >>    transient refcounts and refcounts due to errors.
> >> 
> >> b. Page flags: won't work with/can't benefit from HVO.
> >
> > As above, this was for the purpose of catching bugs, not for guestmemfd to
> > logically depend on it.
> >
> >> 
> >> Suggestions still in the running:
> >> 
> >> c. Folio flags are not precise enough to indicate which page actually
> >>    had an error, but this could be sufficient if we're willing to just
> >>    waste the rest of the huge page on unmapping error.
> >
> > For a scenario of TDX module bug, it seems ok to me.
> >
> >> 
> >> d. Folio flags with folio splitting on error. This means that on
> >>    unmapping/Secure EPT PTE splitting error, we have to split the
> >>    (larger than 4K) folio to 4K, and then set a flag on the split folio.
> >> 
> >>    The issue I see with this is that splitting pages with HVO applied
> >>    means doing allocations, and in an error scenario there may not be
> >>    memory left to split the pages.
> >> 
> >> e. Some other data structure in guest_memfd, say, a linked list, and a
> >>    function like kvm_gmem_add_error_pfn(struct page *page) that would
> >>    look up the guest_memfd inode from the page and add the page's pfn to
> >>    the linked list.
> >> 
> >>    Everywhere in guest_memfd that does unmapping/splitting would then
> >>    check this linked list to see if the unmapping/splitting
> >>    succeeded.
> >> 
> >>    Everywhere in guest_memfd that allocates pages will also check this
> >>    linked list to make sure the pages are functional.
> >> 
> >>    When guest_memfd truncates, if the page being truncated is on the
> >>    list, retain the refcount on the page and leak that page.
> >
> > I think this is a fine option.
> >
> >> 
> >> f. Combination of c and e, something similar to HugeTLB's
> >>    folio_set_hugetlb_hwpoison(), which sets a flag AND adds the pages in
> >>    trouble to a linked list on the folio.
> >> 
> >> g. Like f, but basically treat an unmapping error as hardware poisoning.
> >> 
> >> I'm kind of inclined towards g, to just treat unmapping errors as
> >> HWPOISON and buying into all the HWPOISON handling requirements. What do
> >> yall think? Can a TDX unmapping error be considered as memory poisoning?
> >
> > What does HWPOISON bring over refcounting the page/folio so that it never
> > returns to the page allocator?
> 
> For Topic 2 (handling TDX unmapping errors), HWPOISON is better than
> refcounting because refcounting interferes with conversions (see
> [INTERFERE WITH CONVERSIONS] above).
> 
> > We are bugging the TD in these cases.
> 
> Bugging the TD does not help to prevent future conversions from being
> interfered with.
> 
> 1. Conversions involves unmapping, so we could actually be in a
>    conversion, the unmapping is performed and fails, and then we try to
>    split and enter an infinite loop since private to shared conversions
>    assumes guest_memfd holds the only refcounts on guest_memfd memory.
We should bail out conversion even with the HWPOISON.
e.g.,
1. user triggers private-to-shared ioctl to convert 4K page A within a 2MB folio
   B to shared.
2. kvm_gmem_convert_should_proceed() executes kvm_gmem_split_private() and
   kvm_gmem_zap().
3. kvm_gmem_convert_should_proceed() checks kvm_gmem_has_invalid_folio()
   (Suppose TDX sets HWPOISON to page A or folio B after kvm_gmem_zap(), then
     kvm_gmem_has_invalid_folio() should return true).
4. Return -EFAULT.

If we allow the actual conversion to proceed after step 3, folio B will be split
into 4KB folios, with page A being converted to a shared 4KB folio, which
becomes accessible by userspace.

This could cause a machine check (#MC) on certain platforms. We should avoid
this scenario when possible.


> 2. The conversion ioctl is a guest_memfd ioctl, not a VM ioctl, and so
>    there is no check that the VM is not dead. There shouldn't be any
>    check on the VM, because shareability is a property of the memory and
>    should be changeable independent of the associated VM.
> 
> > Ohhh... Is
> > this about the code to allow gmem fds to be handed to new VMs?
> 
> Nope, it's not related to linking. The proposed KVM_LINK_GUEST_MEMFD
> ioctl [4] also doesn't check if the source VM is dead. There shouldn't
> be any check on the source VM, since the memory is from guest_memfd and
> should be independently transferable to a new VM.
> 
> [4] https://lore.kernel.org/lkml/cover.1747368092.git.afranji@google.com/T/