linux-kernel - Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <diqz34bolnta.fsf@ackerleytng-ctop.c.googlers.com>
Date: Tue, 24 Jun 2025 16:30:57 -0700
From: Ackerley Tng <ackerleytng@...gle.com>
To: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>, "Zhao, Yan Y" <yan.y.zhao@...el.com>
Cc: "quic_eberman@...cinc.com" <quic_eberman@...cinc.com>, "Li, Xiaoyao" <xiaoyao.li@...el.com>, 
	"Du, Fan" <fan.du@...el.com>, "Hansen, Dave" <dave.hansen@...el.com>, 
	"david@...hat.com" <david@...hat.com>, "thomas.lendacky@....com" <thomas.lendacky@....com>, 
	"tabba@...gle.com" <tabba@...gle.com>, "vbabka@...e.cz" <vbabka@...e.cz>, 
	"Shutemov, Kirill" <kirill.shutemov@...el.com>, "michael.roth@....com" <michael.roth@....com>, 
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "seanjc@...gle.com" <seanjc@...gle.com>, 
	"pbonzini@...hat.com" <pbonzini@...hat.com>, 
	"binbin.wu@...ux.intel.com" <binbin.wu@...ux.intel.com>, "Yamahata, Isaku" <isaku.yamahata@...el.com>, 
	"Peng, Chao P" <chao.p.peng@...el.com>, "Weiny, Ira" <ira.weiny@...el.com>, 
	"kvm@...r.kernel.org" <kvm@...r.kernel.org>, "Annapurve, Vishal" <vannapurve@...gle.com>, 
	"jroedel@...e.de" <jroedel@...e.de>, "Miao, Jun" <jun.miao@...el.com>, 
	"Li, Zhiquan1" <zhiquan1.li@...el.com>, "pgonda@...gle.com" <pgonda@...gle.com>, 
	"x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

"Edgecombe, Rick P" <rick.p.edgecombe@...el.com> writes:

> On Mon, 2025-06-23 at 15:48 -0700, Ackerley Tng wrote:
>> Let me try and summarize the current state of this discussion:
>> 
>> Topic 1: Does TDX need to somehow indicate that it is using a page?
>> 
>> This patch series uses refcounts to indicate that TDX is using a page,
>> but that complicates private-to-shared conversions.
>> 
>> During a private-to-shared conversion, guest_memfd assumes that
>> guest_memfd is trusted to manage private memory. TDX and other users
>> should trust guest_memfd to keep the memory around.
>> 
>> Yan's position is that holding a refcount is in line with how IOMMU
>> takes a refcount when a page is mapped into the IOMMU [1].
>> 
>> Yan had another suggestion, which is to indicate using a page flag [2].
>> 
>> I think we're in agreement that we don't want to have TDX hold a
>> refcount while the page is mapped into the Secure EPTs, but taking a
>> step back, do we really need to indicate (at all) that TDX is using a
>> page?
>> 
>> In [3] Yan said
>> 
>> > If TDX does not hold any refcount, guest_memfd has to know that which
>> > private
>> > page is still mapped. Otherwise, the page may be re-assigned to other
>> > kernel
>> > components while it may still be mapped in the S-EPT.
>> 
>> If the private page is mapped for regular VM use as private memory,
>> guest_memfd is managing that, and the same page will not be re-assigned
>> to any other kernel component. guest_memfd does hold refcounts in
>> guest_memfd's filemap.
>> 
>> If the private page is still mapped because there was an unmapping
>> failure, we can discuss that separately under error handling in Topic 2.
>> 
>> With this, can I confirm that we are in agreement that TDX does not need
>> to indicate that it is using a page, and can trust guest_memfd to keep
>> the page around for the VM?
>
> Minor correction here. Yan was concerned about *bugs* happening when freeing
> pages that are accidentally still mapped in the S-EPT. My opinion is that this
> is not especially risky to happen here vs other similar places, but it could be
> helpful if there was a way to catch such bugs. The page flag, or page_ext
> direction came out of a discussion with Dave and Kirill. If it could run all the
> time that would be great, but if not a debug config could be sufficient. For
> example like CONFIG_PAGE_TABLE_CHECK. It doesn't need to support vmemmap
> optimizations because the debug checking doesn't need to run all the time.
> Overhead for debug settings is very normal.
>

I see, let's call debug checking Topic 3 then, to separate it from Topic
1, which is TDX indicating that it is using a page for production
kernels.

Topic 3: How should TDX indicate use of a page for debugging?

I'm okay if for debugging, TDX uses anything other than refcounts for
checking, because refcounts will interfere with conversions.

Rick's other email is correct. The correct link should be
https://lore.kernel.org/all/aFJjZFFhrMWEPjQG@yzhao56-desk.sh.intel.com/.

[INTERFERE WITH CONVERSIONS]

To summarize, if TDX uses refcounts to indicate that it is using a page,
or to indicate anything else, then we cannot easily split a page on
private to shared conversions.

Specifically, consider the case where only the x-th subpage of a huge
folio is mapped into Secure-EPTs. When the guest requests to convert
some subpage to shared, the huge folio has to be split for
core-mm. Core-mm, which will use the shared page, must have split folios
to be able to accurately and separately track refcounts for subpages.

During splitting, guest_memfd would see refcount of 512 (for 2M page
being in the filemap) + 1 (if TDX indicates that the x-th subpage is
mapped using a refcount), but would not be able to tell that the 513th
refcount belongs to the x-th subpage. guest_memfd can't split the huge
folio unless it knows how to distribute the 513th refcount.

One might say guest_memfd could clear all the refcounts that TDX is
holding on the huge folio by unmapping the entire huge folio from the
Secure-EPTs, but unmapping the entire huge folio for TDX means zeroing
the contents and requiring guest re-acceptance. Both of these would mess
up guest operation.

Hence, guest_memfd's solution is to require that users of guest_memfd
for private memory trust guest_memfd to maintain the pages around and
not take any refcounts.

So back to Topic 1, for production kernels, is it okay that TDX does not
need to indicate that it is using a page, and can trust guest_memfd to
keep the page around for the VM?

>> 
>> Topic 2: How to handle unmapping/splitting errors arising from TDX?
>> 
>> Previously I was in favor of having unmap() return an error (Rick
>> suggested doing a POC, and in a more recent email Rick asked for a
>> diffstat), but Vishal and I talked about this and now I agree having
>> unmapping return an error is not a good approach for these reasons.
>
> Ok, let's close this option then.
>
>> 
>> 1. Unmapping takes a range, and within the range there could be more
>>    than one unmapping error. I was previously thinking that unmap()
>>    could return 0 for success and the failed PFN on error. Returning a
>>    single PFN on error is okay-ish but if there are more errors it could
>>    get complicated.
>> 
>>    Another error return option could be to return the folio where the
>>    unmapping/splitting issue happened, but that would not be
>>    sufficiently precise, since a folio could be larger than 4K and we
>>    want to track errors as precisely as we can to reduce memory loss due
>>    to errors.
>> 
>> 2. What I think Yan has been trying to say: unmap() returning an error
>>    is non-standard in the kernel.
>> 
>> I think (1) is the dealbreaker here and there's no need to do the
>> plumbing POC and diffstat.
>> 
>> So I think we're all in support of indicating unmapping/splitting issues
>> without returning anything from unmap(), and the discussed options are
>> 
>> a. Refcounts: won't work - mostly discussed in this (sub-)thread
>>    [3]. Using refcounts makes it impossible to distinguish between
>>    transient refcounts and refcounts due to errors.
>> 
>> b. Page flags: won't work with/can't benefit from HVO.
>
> As above, this was for the purpose of catching bugs, not for guestmemfd to
> logically depend on it.
>
>> 
>> Suggestions still in the running:
>> 
>> c. Folio flags are not precise enough to indicate which page actually
>>    had an error, but this could be sufficient if we're willing to just
>>    waste the rest of the huge page on unmapping error.
>
> For a scenario of TDX module bug, it seems ok to me.
>
>> 
>> d. Folio flags with folio splitting on error. This means that on
>>    unmapping/Secure EPT PTE splitting error, we have to split the
>>    (larger than 4K) folio to 4K, and then set a flag on the split folio.
>> 
>>    The issue I see with this is that splitting pages with HVO applied
>>    means doing allocations, and in an error scenario there may not be
>>    memory left to split the pages.
>> 
>> e. Some other data structure in guest_memfd, say, a linked list, and a
>>    function like kvm_gmem_add_error_pfn(struct page *page) that would
>>    look up the guest_memfd inode from the page and add the page's pfn to
>>    the linked list.
>> 
>>    Everywhere in guest_memfd that does unmapping/splitting would then
>>    check this linked list to see if the unmapping/splitting
>>    succeeded.
>> 
>>    Everywhere in guest_memfd that allocates pages will also check this
>>    linked list to make sure the pages are functional.
>> 
>>    When guest_memfd truncates, if the page being truncated is on the
>>    list, retain the refcount on the page and leak that page.
>
> I think this is a fine option.
>
>> 
>> f. Combination of c and e, something similar to HugeTLB's
>>    folio_set_hugetlb_hwpoison(), which sets a flag AND adds the pages in
>>    trouble to a linked list on the folio.
>> 
>> g. Like f, but basically treat an unmapping error as hardware poisoning.
>> 
>> I'm kind of inclined towards g, to just treat unmapping errors as
>> HWPOISON and buying into all the HWPOISON handling requirements. What do
>> yall think? Can a TDX unmapping error be considered as memory poisoning?
>
> What does HWPOISON bring over refcounting the page/folio so that it never
> returns to the page allocator?

For Topic 2 (handling TDX unmapping errors), HWPOISON is better than
refcounting because refcounting interferes with conversions (see
[INTERFERE WITH CONVERSIONS] above).

> We are bugging the TD in these cases.

Bugging the TD does not help to prevent future conversions from being
interfered with.

1. Conversions involves unmapping, so we could actually be in a
   conversion, the unmapping is performed and fails, and then we try to
   split and enter an infinite loop since private to shared conversions
   assumes guest_memfd holds the only refcounts on guest_memfd memory.

2. The conversion ioctl is a guest_memfd ioctl, not a VM ioctl, and so
   there is no check that the VM is not dead. There shouldn't be any
   check on the VM, because shareability is a property of the memory and
   should be changeable independent of the associated VM.

> Ohhh... Is
> this about the code to allow gmem fds to be handed to new VMs?

Nope, it's not related to linking. The proposed KVM_LINK_GUEST_MEMFD
ioctl [4] also doesn't check if the source VM is dead. There shouldn't
be any check on the source VM, since the memory is from guest_memfd and
should be independently transferable to a new VM.

[4] https://lore.kernel.org/lkml/cover.1747368092.git.afranji@google.com/T/