[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <diqzplerjutn.fsf@ackerleytng-ctop.c.googlers.com>
Date: Wed, 25 Jun 2025 15:54:44 -0700
From: Ackerley Tng <ackerleytng@...gle.com>
To: Yan Zhao <yan.y.zhao@...el.com>
Cc: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>,
"quic_eberman@...cinc.com" <quic_eberman@...cinc.com>, "Li, Xiaoyao" <xiaoyao.li@...el.com>,
"Du, Fan" <fan.du@...el.com>, "Hansen, Dave" <dave.hansen@...el.com>,
"david@...hat.com" <david@...hat.com>, "thomas.lendacky@....com" <thomas.lendacky@....com>,
"tabba@...gle.com" <tabba@...gle.com>, "vbabka@...e.cz" <vbabka@...e.cz>,
"Shutemov, Kirill" <kirill.shutemov@...el.com>, "michael.roth@....com" <michael.roth@....com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "seanjc@...gle.com" <seanjc@...gle.com>,
"pbonzini@...hat.com" <pbonzini@...hat.com>,
"binbin.wu@...ux.intel.com" <binbin.wu@...ux.intel.com>, "Yamahata, Isaku" <isaku.yamahata@...el.com>,
"Peng, Chao P" <chao.p.peng@...el.com>, "Weiny, Ira" <ira.weiny@...el.com>,
"kvm@...r.kernel.org" <kvm@...r.kernel.org>, "Annapurve, Vishal" <vannapurve@...gle.com>,
"jroedel@...e.de" <jroedel@...e.de>, "Miao, Jun" <jun.miao@...el.com>,
"Li, Zhiquan1" <zhiquan1.li@...el.com>, "pgonda@...gle.com" <pgonda@...gle.com>,
"x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
Yan Zhao <yan.y.zhao@...el.com> writes:
> On Tue, Jun 24, 2025 at 04:30:57PM -0700, Ackerley Tng wrote:
>> "Edgecombe, Rick P" <rick.p.edgecombe@...el.com> writes:
>>
>> > On Mon, 2025-06-23 at 15:48 -0700, Ackerley Tng wrote:
>> >> Let me try and summarize the current state of this discussion:
>> >>
>> >> Topic 1: Does TDX need to somehow indicate that it is using a page?
>> >>
>> >> This patch series uses refcounts to indicate that TDX is using a page,
>> >> but that complicates private-to-shared conversions.
>> >>
>> >> During a private-to-shared conversion, guest_memfd assumes that
>> >> guest_memfd is trusted to manage private memory. TDX and other users
>> >> should trust guest_memfd to keep the memory around.
>> >>
>> >> Yan's position is that holding a refcount is in line with how IOMMU
>> >> takes a refcount when a page is mapped into the IOMMU [1].
>> >>
>> >> Yan had another suggestion, which is to indicate using a page flag [2].
>> >>
>> >> I think we're in agreement that we don't want to have TDX hold a
>> >> refcount while the page is mapped into the Secure EPTs, but taking a
>> >> step back, do we really need to indicate (at all) that TDX is using a
>> >> page?
>> >>
>> >> In [3] Yan said
>> >>
>> >> > If TDX does not hold any refcount, guest_memfd has to know that which
>> >> > private
>> >> > page is still mapped. Otherwise, the page may be re-assigned to other
>> >> > kernel
>> >> > components while it may still be mapped in the S-EPT.
>> >>
>> >> If the private page is mapped for regular VM use as private memory,
>> >> guest_memfd is managing that, and the same page will not be re-assigned
>> >> to any other kernel component. guest_memfd does hold refcounts in
>> >> guest_memfd's filemap.
>> >>
>> >> If the private page is still mapped because there was an unmapping
>> >> failure, we can discuss that separately under error handling in Topic 2.
>> >>
>> >> With this, can I confirm that we are in agreement that TDX does not need
>> >> to indicate that it is using a page, and can trust guest_memfd to keep
>> >> the page around for the VM?
>> >
>> > Minor correction here. Yan was concerned about *bugs* happening when freeing
>> > pages that are accidentally still mapped in the S-EPT. My opinion is that this
>> > is not especially risky to happen here vs other similar places, but it could be
>> > helpful if there was a way to catch such bugs. The page flag, or page_ext
>> > direction came out of a discussion with Dave and Kirill. If it could run all the
>> > time that would be great, but if not a debug config could be sufficient. For
>> > example like CONFIG_PAGE_TABLE_CHECK. It doesn't need to support vmemmap
>> > optimizations because the debug checking doesn't need to run all the time.
>> > Overhead for debug settings is very normal.
>> >
>>
>> I see, let's call debug checking Topic 3 then, to separate it from Topic
>> 1, which is TDX indicating that it is using a page for production
>> kernels.
>>
>> Topic 3: How should TDX indicate use of a page for debugging?
>>
>> I'm okay if for debugging, TDX uses anything other than refcounts for
>> checking, because refcounts will interfere with conversions.
>>
>> Rick's other email is correct. The correct link should be
>> https://lore.kernel.org/all/aFJjZFFhrMWEPjQG@yzhao56-desk.sh.intel.com/.
>>
>> [INTERFERE WITH CONVERSIONS]
>>
>> To summarize, if TDX uses refcounts to indicate that it is using a page,
>> or to indicate anything else, then we cannot easily split a page on
>> private to shared conversions.
>>
>> Specifically, consider the case where only the x-th subpage of a huge
>> folio is mapped into Secure-EPTs. When the guest requests to convert
>> some subpage to shared, the huge folio has to be split for
>> core-mm. Core-mm, which will use the shared page, must have split folios
>> to be able to accurately and separately track refcounts for subpages.
>>
>> During splitting, guest_memfd would see refcount of 512 (for 2M page
>> being in the filemap) + 1 (if TDX indicates that the x-th subpage is
>> mapped using a refcount), but would not be able to tell that the 513th
>> refcount belongs to the x-th subpage. guest_memfd can't split the huge
>> folio unless it knows how to distribute the 513th refcount.
> In my POC, https://lore.kernel.org/all/aE%2Fq9VKkmaCcuwpU@yzhao56-desk.sh.intel.com/
> kvm_gmem_private_has_safe_refcount() was introduce to check the folio ref count.
> It rejects private-to-shared conversion after splitting and unmapping KVM's
> secondary page tables if the refcount exceeds a valid threshold.
>
> Though in
> https://lore.kernel.org/all/aFJjZFFhrMWEPjQG@yzhao56-desk.sh.intel.com/, we
> agreed that "EAGAIN is not the right code in case of "extra" refcounts held by
> TDX", this does not imply that rejecting the conversion itself is incorrect.
>
> This is why we are exploring alternative solutions instead of having TDX hold
> the page refcount.
>
> So, either a per-page flag, per-folio flag or solutions e,f,g should be good.
>
> IMO, regardless of the final choice, guest_memfd needs to identify problematic
> folios to:
> - reject the private-to-shared conversion
> - prevent further recycling after kvm_gmem_free_folio()
>
Agreed!
>> One might say guest_memfd could clear all the refcounts that TDX is
>> holding on the huge folio by unmapping the entire huge folio from the
>> Secure-EPTs, but unmapping the entire huge folio for TDX means zeroing
>> the contents and requiring guest re-acceptance. Both of these would mess
>> up guest operation.
>>
>> Hence, guest_memfd's solution is to require that users of guest_memfd
>> for private memory trust guest_memfd to maintain the pages around and
>> not take any refcounts.
>>
>> So back to Topic 1, for production kernels, is it okay that TDX does not
>> need to indicate that it is using a page, and can trust guest_memfd to
>> keep the page around for the VM?
> If the "TDX does not need to indicate that it is using a page" means "do not
> take page refcount", I'm ok.
>
Yes, I was trying to generalize "do not take page refcount" to "TDX does
not need to indicate that it is using a page", but I guess TDX can
indicate that it is using a page for debugging as long as it doesn't
use refcounts or otherwise interfere with conversions. So I believe we
are in agreement on Topic 1 :)
>> >>
>> >> Topic 2: How to handle unmapping/splitting errors arising from TDX?
>> >>
>> >> Previously I was in favor of having unmap() return an error (Rick
>> >> suggested doing a POC, and in a more recent email Rick asked for a
>> >> diffstat), but Vishal and I talked about this and now I agree having
>> >> unmapping return an error is not a good approach for these reasons.
>> >
>> > Ok, let's close this option then.
>> >
>> >>
>> >> 1. Unmapping takes a range, and within the range there could be more
>> >> than one unmapping error. I was previously thinking that unmap()
>> >> could return 0 for success and the failed PFN on error. Returning a
>> >> single PFN on error is okay-ish but if there are more errors it could
>> >> get complicated.
>> >>
>> >> Another error return option could be to return the folio where the
>> >> unmapping/splitting issue happened, but that would not be
>> >> sufficiently precise, since a folio could be larger than 4K and we
>> >> want to track errors as precisely as we can to reduce memory loss due
>> >> to errors.
>> >>
>> >> 2. What I think Yan has been trying to say: unmap() returning an error
>> >> is non-standard in the kernel.
>> >>
>> >> I think (1) is the dealbreaker here and there's no need to do the
>> >> plumbing POC and diffstat.
>> >>
>> >> So I think we're all in support of indicating unmapping/splitting issues
>> >> without returning anything from unmap(), and the discussed options are
>> >>
>> >> a. Refcounts: won't work - mostly discussed in this (sub-)thread
>> >> [3]. Using refcounts makes it impossible to distinguish between
>> >> transient refcounts and refcounts due to errors.
>> >>
>> >> b. Page flags: won't work with/can't benefit from HVO.
>> >
>> > As above, this was for the purpose of catching bugs, not for guestmemfd to
>> > logically depend on it.
>> >
>> >>
>> >> Suggestions still in the running:
>> >>
>> >> c. Folio flags are not precise enough to indicate which page actually
>> >> had an error, but this could be sufficient if we're willing to just
>> >> waste the rest of the huge page on unmapping error.
>> >
>> > For a scenario of TDX module bug, it seems ok to me.
>> >
>> >>
>> >> d. Folio flags with folio splitting on error. This means that on
>> >> unmapping/Secure EPT PTE splitting error, we have to split the
>> >> (larger than 4K) folio to 4K, and then set a flag on the split folio.
>> >>
>> >> The issue I see with this is that splitting pages with HVO applied
>> >> means doing allocations, and in an error scenario there may not be
>> >> memory left to split the pages.
>> >>
>> >> e. Some other data structure in guest_memfd, say, a linked list, and a
>> >> function like kvm_gmem_add_error_pfn(struct page *page) that would
>> >> look up the guest_memfd inode from the page and add the page's pfn to
>> >> the linked list.
>> >>
>> >> Everywhere in guest_memfd that does unmapping/splitting would then
>> >> check this linked list to see if the unmapping/splitting
>> >> succeeded.
>> >>
>> >> Everywhere in guest_memfd that allocates pages will also check this
>> >> linked list to make sure the pages are functional.
>> >>
>> >> When guest_memfd truncates, if the page being truncated is on the
>> >> list, retain the refcount on the page and leak that page.
>> >
>> > I think this is a fine option.
>> >
>> >>
>> >> f. Combination of c and e, something similar to HugeTLB's
>> >> folio_set_hugetlb_hwpoison(), which sets a flag AND adds the pages in
>> >> trouble to a linked list on the folio.
>> >>
>> >> g. Like f, but basically treat an unmapping error as hardware poisoning.
>> >>
>> >> I'm kind of inclined towards g, to just treat unmapping errors as
>> >> HWPOISON and buying into all the HWPOISON handling requirements. What do
>> >> yall think? Can a TDX unmapping error be considered as memory poisoning?
>> >
>> > What does HWPOISON bring over refcounting the page/folio so that it never
>> > returns to the page allocator?
>>
>> For Topic 2 (handling TDX unmapping errors), HWPOISON is better than
>> refcounting because refcounting interferes with conversions (see
>> [INTERFERE WITH CONVERSIONS] above).
>>
>> > We are bugging the TD in these cases.
>>
>> Bugging the TD does not help to prevent future conversions from being
>> interfered with.
>>
>> 1. Conversions involves unmapping, so we could actually be in a
>> conversion, the unmapping is performed and fails, and then we try to
>> split and enter an infinite loop since private to shared conversions
>> assumes guest_memfd holds the only refcounts on guest_memfd memory.
> We should bail out conversion even with the HWPOISON.
> e.g.,
> 1. user triggers private-to-shared ioctl to convert 4K page A within a 2MB folio
> B to shared.
> 2. kvm_gmem_convert_should_proceed() executes kvm_gmem_split_private() and
> kvm_gmem_zap().
> 3. kvm_gmem_convert_should_proceed() checks kvm_gmem_has_invalid_folio()
> (Suppose TDX sets HWPOISON to page A or folio B after kvm_gmem_zap(), then
> kvm_gmem_has_invalid_folio() should return true).
> 4. Return -EFAULT.
>
> If we allow the actual conversion to proceed after step 3, folio B will be split
> into 4KB folios, with page A being converted to a shared 4KB folio, which
> becomes accessible by userspace.
>
> This could cause a machine check (#MC) on certain platforms. We should avoid
> this scenario when possible.
>
>
Thanks for pointing this out. This is a good point and will definitely
have to be handled separately under "what to do when there was some
issue on a page (possibly caused by unmapping)", or as Yan pointed out
above, what to do when "handling problematic folios".
Regarding handling errors, or recording errors, or communicating errors
to guest_memfd, Yan seems in favor of some kind of page flag. I know
Rick is suggesting option e. Can we do something between f and g? I'm
thinking that the easiest page flag to use is the HWpoison flag, because
1. HWpoison checking is already part of guest_memfd, or should
be.
guest_memfd already checks HWpoison for kvm_gmem_get_pfn(), and
__do_fault() checks HWpoison for guest_memfd [5]. As Yan pointed out
above, it should definitely check and deal with HWpoison on
conversion. Perhaps on truncation it should look at HWpoison, or very
likely memory_failure() will need special handling for guest_memfd
folios. I'll look into this separately as part of HugeTLB support
patch series.
2. HWpoison support (especially tracking of sub-folio HWpoison) in
folio_set_hugetlb_hwpoison() can hopefully be reused for guest_memfd.
3. For now, no need to invent a new tracking mechanism or data structure
to support option e.
4. HWpoison is kind of "part of guest_memfd" if you consider that
guest_memfd folios to be pretty much always owned by some guest_memfd
inode, and if the HWpoison flag is checked at the appropriate places.
Could we (at least for the next RFC of this TDX huge page support for
private memory series, or as a first cut), use HWpoison, and then if we
identify that the concept of HWpoison is being overloaded/polluted, then
try another flag/mechanism for tracking?
I plan to work on HWpoison/kvm_gmem_error_folio() handling for HugeTLB
soon and then I can keep the community updated if I find anything new or
incompatible.
>> 2. The conversion ioctl is a guest_memfd ioctl, not a VM ioctl, and so
>> there is no check that the VM is not dead. There shouldn't be any
>> check on the VM, because shareability is a property of the memory and
>> should be changeable independent of the associated VM.
>>
>> > Ohhh... Is
>> > this about the code to allow gmem fds to be handed to new VMs?
>>
>> Nope, it's not related to linking. The proposed KVM_LINK_GUEST_MEMFD
>> ioctl [4] also doesn't check if the source VM is dead. There shouldn't
>> be any check on the source VM, since the memory is from guest_memfd and
>> should be independently transferable to a new VM.
>>
>> [4] https://lore.kernel.org/lkml/cover.1747368092.git.afranji@google.com/T/
[5] https://lore.kernel.org/all/diqzv7ojjxyd.fsf@ackerleytng-ctop.c.googlers.com/
Powered by blists - more mailing lists