linux-kernel - Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <diqzms9vju5j.fsf@ackerleytng-ctop.c.googlers.com>
Date: Wed, 25 Jun 2025 16:09:12 -0700
From: Ackerley Tng <ackerleytng@...gle.com>
To: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>, "Zhao, Yan Y" <yan.y.zhao@...el.com>
Cc: "quic_eberman@...cinc.com" <quic_eberman@...cinc.com>, "Li, Xiaoyao" <xiaoyao.li@...el.com>, 
	"Shutemov, Kirill" <kirill.shutemov@...el.com>, "Hansen, Dave" <dave.hansen@...el.com>, 
	"david@...hat.com" <david@...hat.com>, "thomas.lendacky@....com" <thomas.lendacky@....com>, 
	"tabba@...gle.com" <tabba@...gle.com>, "vbabka@...e.cz" <vbabka@...e.cz>, "Du, Fan" <fan.du@...el.com>, 
	"michael.roth@....com" <michael.roth@....com>, 
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "seanjc@...gle.com" <seanjc@...gle.com>, 
	"Peng, Chao P" <chao.p.peng@...el.com>, "pbonzini@...hat.com" <pbonzini@...hat.com>, 
	"Yamahata, Isaku" <isaku.yamahata@...el.com>, 
	"binbin.wu@...ux.intel.com" <binbin.wu@...ux.intel.com>, "Weiny, Ira" <ira.weiny@...el.com>, 
	"kvm@...r.kernel.org" <kvm@...r.kernel.org>, "Annapurve, Vishal" <vannapurve@...gle.com>, 
	"jroedel@...e.de" <jroedel@...e.de>, "Miao, Jun" <jun.miao@...el.com>, 
	"Li, Zhiquan1" <zhiquan1.li@...el.com>, "pgonda@...gle.com" <pgonda@...gle.com>, 
	"x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

"Edgecombe, Rick P" <rick.p.edgecombe@...el.com> writes:

> On Tue, 2025-06-24 at 16:30 -0700, Ackerley Tng wrote:
>> I see, let's call debug checking Topic 3 then, to separate it from Topic
>> 1, which is TDX indicating that it is using a page for production
>> kernels.
>> 
>> Topic 3: How should TDX indicate use of a page for debugging?
>> 
>> I'm okay if for debugging, TDX uses anything other than refcounts for
>> checking, because refcounts will interfere with conversions.
>
> Ok. It can be follow on work I think.
>

Yup I agree.

>> 
>> Rick's other email is correct. The correct link should be
>> https://lore.kernel.org/all/aFJjZFFhrMWEPjQG@yzhao56-desk.sh.intel.com/.
>> 
>> [INTERFERE WITH CONVERSIONS]
>> 
>> To summarize, if TDX uses refcounts to indicate that it is using a page,
>> or to indicate anything else, then we cannot easily split a page on
>> private to shared conversions.
>> 
>> Specifically, consider the case where only the x-th subpage of a huge
>> folio is mapped into Secure-EPTs. When the guest requests to convert
>> some subpage to shared, the huge folio has to be split for
>> core-mm. Core-mm, which will use the shared page, must have split folios
>> to be able to accurately and separately track refcounts for subpages.
>> 
>> During splitting, guest_memfd would see refcount of 512 (for 2M page
>> being in the filemap) + 1 (if TDX indicates that the x-th subpage is
>> mapped using a refcount), but would not be able to tell that the 513th
>> refcount belongs to the x-th subpage. guest_memfd can't split the huge
>> folio unless it knows how to distribute the 513th refcount.
>> 
>> One might say guest_memfd could clear all the refcounts that TDX is
>> holding on the huge folio by unmapping the entire huge folio from the
>> Secure-EPTs, but unmapping the entire huge folio for TDX means zeroing
>> the contents and requiring guest re-acceptance. Both of these would mess
>> up guest operation.
>> 
>> Hence, guest_memfd's solution is to require that users of guest_memfd
>> for private memory trust guest_memfd to maintain the pages around and
>> not take any refcounts.
>> 
>> So back to Topic 1, for production kernels, is it okay that TDX does not
>> need to indicate that it is using a page, and can trust guest_memfd to
>> keep the page around for the VM?
>
> I think Yan's concern is not totally invalid. But I don't see a problem if we
> have a line of sight to adding debug checking as follow on work. That is kind of
> the path I was trying to nudge.
>
>> 
>> > > 
>> > > Topic 2: How to handle unmapping/splitting errors arising from TDX?
>> > > 
>> > > Previously I was in favor of having unmap() return an error (Rick
>> > > suggested doing a POC, and in a more recent email Rick asked for a
>> > > diffstat), but Vishal and I talked about this and now I agree having
>> > > unmapping return an error is not a good approach for these reasons.
>> > 
>> > Ok, let's close this option then.
>> > 
>> > > 
>> > > 1. Unmapping takes a range, and within the range there could be more
>> > >    than one unmapping error. I was previously thinking that unmap()
>> > >    could return 0 for success and the failed PFN on error. Returning a
>> > >    single PFN on error is okay-ish but if there are more errors it could
>> > >    get complicated.
>> > > 
>> > >    Another error return option could be to return the folio where the
>> > >    unmapping/splitting issue happened, but that would not be
>> > >    sufficiently precise, since a folio could be larger than 4K and we
>> > >    want to track errors as precisely as we can to reduce memory loss due
>> > >    to errors.
>> > > 
>> > > 2. What I think Yan has been trying to say: unmap() returning an error
>> > >    is non-standard in the kernel.
>> > > 
>> > > I think (1) is the dealbreaker here and there's no need to do the
>> > > plumbing POC and diffstat.
>> > > 
>> > > So I think we're all in support of indicating unmapping/splitting issues
>> > > without returning anything from unmap(), and the discussed options are
>> > > 
>> > > a. Refcounts: won't work - mostly discussed in this (sub-)thread
>> > >    [3]. Using refcounts makes it impossible to distinguish between
>> > >    transient refcounts and refcounts due to errors.
>> > > 
>> > > b. Page flags: won't work with/can't benefit from HVO.
>> > 
>> > As above, this was for the purpose of catching bugs, not for guestmemfd to
>> > logically depend on it.
>> > 
>> > > 
>> > > Suggestions still in the running:
>> > > 
>> > > c. Folio flags are not precise enough to indicate which page actually
>> > >    had an error, but this could be sufficient if we're willing to just
>> > >    waste the rest of the huge page on unmapping error.
>> > 
>> > For a scenario of TDX module bug, it seems ok to me.
>> > 
>> > > 
>> > > d. Folio flags with folio splitting on error. This means that on
>> > >    unmapping/Secure EPT PTE splitting error, we have to split the
>> > >    (larger than 4K) folio to 4K, and then set a flag on the split folio.
>> > > 
>> > >    The issue I see with this is that splitting pages with HVO applied
>> > >    means doing allocations, and in an error scenario there may not be
>> > >    memory left to split the pages.
>> > > 
>> > > e. Some other data structure in guest_memfd, say, a linked list, and a
>> > >    function like kvm_gmem_add_error_pfn(struct page *page) that would
>> > >    look up the guest_memfd inode from the page and add the page's pfn to
>> > >    the linked list.
>> > > 
>> > >    Everywhere in guest_memfd that does unmapping/splitting would then
>> > >    check this linked list to see if the unmapping/splitting
>> > >    succeeded.
>> > > 
>> > >    Everywhere in guest_memfd that allocates pages will also check this
>> > >    linked list to make sure the pages are functional.
>> > > 
>> > >    When guest_memfd truncates, if the page being truncated is on the
>> > >    list, retain the refcount on the page and leak that page.
>> > 
>> > I think this is a fine option.
>> > 
>> > > 
>> > > f. Combination of c and e, something similar to HugeTLB's
>> > >    folio_set_hugetlb_hwpoison(), which sets a flag AND adds the pages in
>> > >    trouble to a linked list on the folio.
>> > > 
>> > > g. Like f, but basically treat an unmapping error as hardware poisoning.
>> > > 
>> > > I'm kind of inclined towards g, to just treat unmapping errors as
>> > > HWPOISON and buying into all the HWPOISON handling requirements. What do
>> > > yall think? Can a TDX unmapping error be considered as memory poisoning?
>> > 
>> > What does HWPOISON bring over refcounting the page/folio so that it never
>> > returns to the page allocator?
>> 
>> For Topic 2 (handling TDX unmapping errors), HWPOISON is better than
>> refcounting because refcounting interferes with conversions (see
>> [INTERFERE WITH CONVERSIONS] above).
>
> I don't know if it quite fits. I think it would be better to not pollute the
> concept if possible.
>

If there's something we know for sure doesn't fit, and that we're
overloading/polluting the concept of HWpoison, then we shouldn't
proceed, but otherwise, is it okay to go with HWpoison as a first cut? I
replied Yan's email with reasons why I think we should give HWpoison a
try, at least for the next RFC.

>> 
>> > We are bugging the TD in these cases.
>> 
>> Bugging the TD does not help to prevent future conversions from being
>> interfered with.
>> 
>> 1. Conversions involves unmapping, so we could actually be in a
>>    conversion, the unmapping is performed and fails, and then we try to
>>    split and enter an infinite loop since private to shared conversions
>>    assumes guest_memfd holds the only refcounts on guest_memfd memory.
>> 
>> 2. The conversion ioctl is a guest_memfd ioctl, not a VM ioctl, and so
>>    there is no check that the VM is not dead. There shouldn't be any
>>    check on the VM, because shareability is a property of the memory and
>>    should be changeable independent of the associated VM.
>
> Hmm, they are both about unlinking guestmemfd from a VM lifecycle then. Is that
> a better way to put it?
>

Unmapping during conversions doesn't take memory away from a VM, it just
forces the memory to be re-faulted as shared, so unlinking memory from a
VM lifecycle isn't quite accurate, if I understand you correctly.

>> 
>> > Ohhh... Is
>> > this about the code to allow gmem fds to be handed to new VMs?
>> 
>> Nope, it's not related to linking. The proposed KVM_LINK_GUEST_MEMFD
>> ioctl [4] also doesn't check if the source VM is dead. There shouldn't
>> be any check on the source VM, since the memory is from guest_memfd and
>> should be independently transferable to a new VM.
>
> If a page is mapped in the old TD, and a new TD is started, re-mapping the same
> page should be prevented somehow, right?
>

Currently I'm thinking that if we go with HWpoison, the new TD will
still get the HWpoison-ed page. The new TD will get the SIGBUS when it
next faults the HWpoison-ed page.

Are you thinking that the HWpoison-ed page should be replaced with a
non-poisoned page for the new TD to run?

Or are you thinking that

* the handing over should be blocked, or
* mapping itself should be blocked, or
* faulting should be blocked?

If handing over should be blocked, could we perhaps scan for HWpoison
when doing the handover and block it there?

I guess I'm trying to do as little as possible during error discovery
(hoping to just mark HWpoison), error handling (just unmap from guest
page tables, like guest_memfd does now), and defer handling to
fault/conversion/perhaps truncation time.

> It really does seem like guestmemfd is the right place to keep the the "stuck
> page" state. If guestmemfd is not tied to a VM and can be re-used, it should be
> the one to decide whether they can be mapped again.

Yup, guest_memfd should get to decide.

> Refcounting on error is
> about preventing return to the page allocator but that is not the only problem.
>

guest_memfd, or perhaps the memory_failure() handler for guest_memfd,
should prevent this return.

> I do think that these threads have gone on far too long. It's probably about
> time to move forward with something even if it's just to have something to
> discuss that doesn't require footnoting so many lore links. So how about we move
> forward with option e as a next step. Does that sound good Yan?
>

Please see my reply to Yan, I'm hoping y'all will agree to something
between option f/g instead.

> Ackerley, thank you very much for pulling together this summary.

Thank you for your reviews and suggestions!