linux-kernel - Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEvNRgG40xtobd=ocReuFydJ-4iFwAQrdTPcjsVQPugMaaLi_A@mail.gmail.com>
Date: Wed, 14 Jan 2026 10:45:32 -0800
From: Ackerley Tng <ackerleytng@...gle.com>
To: Sean Christopherson <seanjc@...gle.com>, Yan Zhao <yan.y.zhao@...el.com>
Cc: Vishal Annapurve <vannapurve@...gle.com>, pbonzini@...hat.com, 
	linux-kernel@...r.kernel.org, kvm@...r.kernel.org, x86@...nel.org, 
	rick.p.edgecombe@...el.com, dave.hansen@...el.com, kas@...nel.org, 
	tabba@...gle.com, michael.roth@....com, david@...nel.org, sagis@...gle.com, 
	vbabka@...e.cz, thomas.lendacky@....com, nik.borisov@...e.com, 
	pgonda@...gle.com, fan.du@...el.com, jun.miao@...el.com, 
	francescolavra.fl@...il.com, jgross@...e.com, ira.weiny@...el.com, 
	isaku.yamahata@...el.com, xiaoyao.li@...el.com, kai.huang@...el.com, 
	binbin.wu@...ux.intel.com, chao.p.peng@...el.com, chao.gao@...el.com
Subject: Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Sean Christopherson <seanjc@...gle.com> writes:

>>
>> [...snip...]
>>
>> > +100 to being careful, but at the same time I don't think we should get _too_
>> > fixated on the guest_memfd folio size.  E.g. similar to VM_PFNMAP, where there
>> > might not be a folio, if guest_memfd stopped using folios, then the entire
>> > discussion becomes moot.

+1, IMO the usage of folios on the guest_memfd <-> KVM boundary
(kvm_gmem_get_pfn()) is transitional, hopefully we get to a point where
guest_memfd will pass KVM pfn, order and no folios.

>> > And as above, the long-standing rule isn't about the implementation details so
>> > much as it is about KVM's behavior.  If the simplest solution to support huge
>> > guest_memfd pages is to decouple the max order from the folio, then so be it.
>> >
>> > That said, I'd very much like to get a sense of the alternatives, because at the
>> > end of the day, guest_memfd needs to track the max mapping sizes _somewhere_,
>> > and naively, tying that to the folio seems like an easy solution.

The upcoming attributes maple tree allows a lookup from guest_memfd
index to contiguous range, so the max mapping size (at least
guest_memfd's contribution to max mapping level, to be augmented by
contribution from lpage_info etc) would be the contiguous range in the
xarray containing the index, clamped to guest_memfd page size bounds
(both for huge pages and regular PAGE_SIZE pages).

The lookup complexity is mainly the maple tree lookup complexity. This
lookup happens on mapping and on trying to recover to the largest
mapping level, both of which shouldn't happen super often, so I think
this should be pretty good for now.

This max mapping size is currently memoized as folio size with all the
folio splitting work, but memoizing into a folio is expensive (struct
pages/folios are big). Hopefully guest_memfd gets to a point where it
also supports non-struct page backed memory, and that would save us a
bunch more memory.

>>
>> [...snip...]
>>
>> So, out of curiosity, do you know why linux kernel needs to unmap mappings from
>> both primary and secondary MMUs, and check folio refcount before performing
>> folio splitting?
>
> Because it's a straightforward rule for the primary MMU.  Similar to guest_memfd,
> if something is going through the effort of splitting a folio, then odds are very,
> very good that the new folios can't be safely mapped as a contiguous hugepage.
> Limiting mapping sizes to folios makes the rules/behavior straightfoward for core
> MM to implement, and for drivers/users to understand.
>
> Again like guest_memfd, there needs to be _some_ way for a driver/filesystem to
> communicate the maximum mapping size; folios are the "currency" for doing so.
>
> And then for edge cases that want to map a split folio as a hugepage (if any such
> edge cases exist), thus take on the responsibility of managing the lifecycle of
> the mappings, VM_PFNMAP and vmf_insert_pfn() provide the necessary functionality.
>

Here's my understanding, hope it helps: there might also be a
practical/simpler reason for first unmapping then check refcounts, and
then splitting folios, and guest_memfd kind of does the same thing.

Folio splitting races with lots of other things in the kernel, and the
folio lock isn't super useful because the lock itself is going to be
split up.

Folio splitting wants all users to stop using this folio, so one big
source of users is mappings. Hence, get those mappers (both primary and
secondary MMUs) to unmap.

Core-mm-managed mappings take a refcount, so those refcounts go away. Of
the secondary mmu notifiers, KVM doesn't take a refcount, but KVM does
unmap as requested, so that still falls in line with "stop using this
folio".

I think the refcounting check isn't actually necessary if all users of
folios STOP using the folio on request (via mmu notifiers or
otherwise). Unfortunately, there are other users other than mappers. The
best way to find these users is to check the refcount. The refcount
check is asking "how many other users are left?" and if the number of
users is as expected (just the filemap, or whatever else is expected),
then splitting can go ahead, since the splitting code is now confident
the remaining users won't try and use the folio metadata while splitting
is happening.


guest_memfd does a modified version of that on shared to private
conversions. guest_memfd will unmap from host userspace page tables for
the same reason, mainly to tell all the host users to unmap. The
unmapping also triggers mmu notifiers so the stage 2 mappings also go
away (TBD if this should be skipped) and this is okay because they're
shared pages. guest usage will just map them back in on any failure and
it doesn't break guests.

At this point all the mappers are gone, then guest_memfd checks
refcounts to make sure that guest_memfd itself is the only user of the
folio. If the refcount is as expected, guest_memfd is confident to
continue with splitting folios, since other folio accesses will be
locked out by the filemap invalidate lock.

The one main guest_memfd folio user that won't go away on an unmap call
is if the folios get pinned for IOMMU access. In this case, guest_memfd
fails the conversion and returns an error to userspace so userspace can
sort out the IOMMU unpinning.


As for private to shared conversions, folio merging would require the
same thing that nobody else is using the folios (the folio
metadata). guest_memfd skips that check because for private memory, KVM
is the only other user, and guest_memfd knows KVM doesn't use folio
metadata once the memory is mapped for the guest.

>>
>> [...snip...]
>>