linux-kernel - Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a3cace55ee878fefc50c68bb2b1fa38851a67dd8.camel@intel.com>
Date: Wed, 25 Jun 2025 00:01:41 +0000
From: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
To: "ackerleytng@...gle.com" <ackerleytng@...gle.com>, "Zhao, Yan Y"
	<yan.y.zhao@...el.com>
CC: "quic_eberman@...cinc.com" <quic_eberman@...cinc.com>, "Li, Xiaoyao"
	<xiaoyao.li@...el.com>, "Shutemov, Kirill" <kirill.shutemov@...el.com>,
	"Hansen, Dave" <dave.hansen@...el.com>, "david@...hat.com"
	<david@...hat.com>, "thomas.lendacky@....com" <thomas.lendacky@....com>,
	"tabba@...gle.com" <tabba@...gle.com>, "vbabka@...e.cz" <vbabka@...e.cz>,
	"Du, Fan" <fan.du@...el.com>, "michael.roth@....com" <michael.roth@....com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"seanjc@...gle.com" <seanjc@...gle.com>, "Peng, Chao P"
	<chao.p.peng@...el.com>, "pbonzini@...hat.com" <pbonzini@...hat.com>,
	"Yamahata, Isaku" <isaku.yamahata@...el.com>, "binbin.wu@...ux.intel.com"
	<binbin.wu@...ux.intel.com>, "Weiny, Ira" <ira.weiny@...el.com>,
	"kvm@...r.kernel.org" <kvm@...r.kernel.org>, "Annapurve, Vishal"
	<vannapurve@...gle.com>, "jroedel@...e.de" <jroedel@...e.de>, "Miao, Jun"
	<jun.miao@...el.com>, "Li, Zhiquan1" <zhiquan1.li@...el.com>,
	"pgonda@...gle.com" <pgonda@...gle.com>, "x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge
 pages

On Tue, 2025-06-24 at 16:30 -0700, Ackerley Tng wrote:
> I see, let's call debug checking Topic 3 then, to separate it from Topic
> 1, which is TDX indicating that it is using a page for production
> kernels.
> 
> Topic 3: How should TDX indicate use of a page for debugging?
> 
> I'm okay if for debugging, TDX uses anything other than refcounts for
> checking, because refcounts will interfere with conversions.

Ok. It can be follow on work I think.

> 
> Rick's other email is correct. The correct link should be
> https://lore.kernel.org/all/aFJjZFFhrMWEPjQG@yzhao56-desk.sh.intel.com/.
> 
> [INTERFERE WITH CONVERSIONS]
> 
> To summarize, if TDX uses refcounts to indicate that it is using a page,
> or to indicate anything else, then we cannot easily split a page on
> private to shared conversions.
> 
> Specifically, consider the case where only the x-th subpage of a huge
> folio is mapped into Secure-EPTs. When the guest requests to convert
> some subpage to shared, the huge folio has to be split for
> core-mm. Core-mm, which will use the shared page, must have split folios
> to be able to accurately and separately track refcounts for subpages.
> 
> During splitting, guest_memfd would see refcount of 512 (for 2M page
> being in the filemap) + 1 (if TDX indicates that the x-th subpage is
> mapped using a refcount), but would not be able to tell that the 513th
> refcount belongs to the x-th subpage. guest_memfd can't split the huge
> folio unless it knows how to distribute the 513th refcount.
> 
> One might say guest_memfd could clear all the refcounts that TDX is
> holding on the huge folio by unmapping the entire huge folio from the
> Secure-EPTs, but unmapping the entire huge folio for TDX means zeroing
> the contents and requiring guest re-acceptance. Both of these would mess
> up guest operation.
> 
> Hence, guest_memfd's solution is to require that users of guest_memfd
> for private memory trust guest_memfd to maintain the pages around and
> not take any refcounts.
> 
> So back to Topic 1, for production kernels, is it okay that TDX does not
> need to indicate that it is using a page, and can trust guest_memfd to
> keep the page around for the VM?

I think Yan's concern is not totally invalid. But I don't see a problem if we
have a line of sight to adding debug checking as follow on work. That is kind of
the path I was trying to nudge.

> 
> > > 
> > > Topic 2: How to handle unmapping/splitting errors arising from TDX?
> > > 
> > > Previously I was in favor of having unmap() return an error (Rick
> > > suggested doing a POC, and in a more recent email Rick asked for a
> > > diffstat), but Vishal and I talked about this and now I agree having
> > > unmapping return an error is not a good approach for these reasons.
> > 
> > Ok, let's close this option then.
> > 
> > > 
> > > 1. Unmapping takes a range, and within the range there could be more
> > >    than one unmapping error. I was previously thinking that unmap()
> > >    could return 0 for success and the failed PFN on error. Returning a
> > >    single PFN on error is okay-ish but if there are more errors it could
> > >    get complicated.
> > > 
> > >    Another error return option could be to return the folio where the
> > >    unmapping/splitting issue happened, but that would not be
> > >    sufficiently precise, since a folio could be larger than 4K and we
> > >    want to track errors as precisely as we can to reduce memory loss due
> > >    to errors.
> > > 
> > > 2. What I think Yan has been trying to say: unmap() returning an error
> > >    is non-standard in the kernel.
> > > 
> > > I think (1) is the dealbreaker here and there's no need to do the
> > > plumbing POC and diffstat.
> > > 
> > > So I think we're all in support of indicating unmapping/splitting issues
> > > without returning anything from unmap(), and the discussed options are
> > > 
> > > a. Refcounts: won't work - mostly discussed in this (sub-)thread
> > >    [3]. Using refcounts makes it impossible to distinguish between
> > >    transient refcounts and refcounts due to errors.
> > > 
> > > b. Page flags: won't work with/can't benefit from HVO.
> > 
> > As above, this was for the purpose of catching bugs, not for guestmemfd to
> > logically depend on it.
> > 
> > > 
> > > Suggestions still in the running:
> > > 
> > > c. Folio flags are not precise enough to indicate which page actually
> > >    had an error, but this could be sufficient if we're willing to just
> > >    waste the rest of the huge page on unmapping error.
> > 
> > For a scenario of TDX module bug, it seems ok to me.
> > 
> > > 
> > > d. Folio flags with folio splitting on error. This means that on
> > >    unmapping/Secure EPT PTE splitting error, we have to split the
> > >    (larger than 4K) folio to 4K, and then set a flag on the split folio.
> > > 
> > >    The issue I see with this is that splitting pages with HVO applied
> > >    means doing allocations, and in an error scenario there may not be
> > >    memory left to split the pages.
> > > 
> > > e. Some other data structure in guest_memfd, say, a linked list, and a
> > >    function like kvm_gmem_add_error_pfn(struct page *page) that would
> > >    look up the guest_memfd inode from the page and add the page's pfn to
> > >    the linked list.
> > > 
> > >    Everywhere in guest_memfd that does unmapping/splitting would then
> > >    check this linked list to see if the unmapping/splitting
> > >    succeeded.
> > > 
> > >    Everywhere in guest_memfd that allocates pages will also check this
> > >    linked list to make sure the pages are functional.
> > > 
> > >    When guest_memfd truncates, if the page being truncated is on the
> > >    list, retain the refcount on the page and leak that page.
> > 
> > I think this is a fine option.
> > 
> > > 
> > > f. Combination of c and e, something similar to HugeTLB's
> > >    folio_set_hugetlb_hwpoison(), which sets a flag AND adds the pages in
> > >    trouble to a linked list on the folio.
> > > 
> > > g. Like f, but basically treat an unmapping error as hardware poisoning.
> > > 
> > > I'm kind of inclined towards g, to just treat unmapping errors as
> > > HWPOISON and buying into all the HWPOISON handling requirements. What do
> > > yall think? Can a TDX unmapping error be considered as memory poisoning?
> > 
> > What does HWPOISON bring over refcounting the page/folio so that it never
> > returns to the page allocator?
> 
> For Topic 2 (handling TDX unmapping errors), HWPOISON is better than
> refcounting because refcounting interferes with conversions (see
> [INTERFERE WITH CONVERSIONS] above).

I don't know if it quite fits. I think it would be better to not pollute the
concept if possible.

> 
> > We are bugging the TD in these cases.
> 
> Bugging the TD does not help to prevent future conversions from being
> interfered with.
> 
> 1. Conversions involves unmapping, so we could actually be in a
>    conversion, the unmapping is performed and fails, and then we try to
>    split and enter an infinite loop since private to shared conversions
>    assumes guest_memfd holds the only refcounts on guest_memfd memory.
> 
> 2. The conversion ioctl is a guest_memfd ioctl, not a VM ioctl, and so
>    there is no check that the VM is not dead. There shouldn't be any
>    check on the VM, because shareability is a property of the memory and
>    should be changeable independent of the associated VM.

Hmm, they are both about unlinking guestmemfd from a VM lifecycle then. Is that
a better way to put it?

> 
> > Ohhh... Is
> > this about the code to allow gmem fds to be handed to new VMs?
> 
> Nope, it's not related to linking. The proposed KVM_LINK_GUEST_MEMFD
> ioctl [4] also doesn't check if the source VM is dead. There shouldn't
> be any check on the source VM, since the memory is from guest_memfd and
> should be independently transferable to a new VM.

If a page is mapped in the old TD, and a new TD is started, re-mapping the same
page should be prevented somehow, right?

It really does seem like guestmemfd is the right place to keep the the "stuck
page" state. If guestmemfd is not tied to a VM and can be re-used, it should be
the one to decide whether they can be mapped again. Refcounting on error is
about preventing return to the page allocator but that is not the only problem.

I do think that these threads have gone on far too long. It's probably about
time to move forward with something even if it's just to have something to
discuss that doesn't require footnoting so many lore links. So how about we move
forward with option e as a next step. Does that sound good Yan?

Ackerley, thank you very much for pulling together this summary.