[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aWRW51ckW2pxmAlK@yzhao56-desk.sh.intel.com>
Date: Mon, 12 Jan 2026 10:12:11 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Ackerley Tng <ackerleytng@...gle.com>, Vishal Annapurve
<vannapurve@...gle.com>, Sean Christopherson <seanjc@...gle.com>,
<pbonzini@...hat.com>, <linux-kernel@...r.kernel.org>, <kvm@...r.kernel.org>,
<x86@...nel.org>, <rick.p.edgecombe@...el.com>, <dave.hansen@...el.com>,
<kas@...nel.org>, <tabba@...gle.com>, <michael.roth@....com>,
<david@...nel.org>, <sagis@...gle.com>, <vbabka@...e.cz>,
<thomas.lendacky@....com>, <nik.borisov@...e.com>, <pgonda@...gle.com>,
<fan.du@...el.com>, <jun.miao@...el.com>, <francescolavra.fl@...il.com>,
<jgross@...e.com>, <ira.weiny@...el.com>, <isaku.yamahata@...el.com>,
<xiaoyao.li@...el.com>, <kai.huang@...el.com>, <binbin.wu@...ux.intel.com>,
<chao.p.peng@...el.com>, <chao.gao@...el.com>
Subject: Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
On Mon, Jan 12, 2026 at 09:39:39AM +0800, Yan Zhao wrote:
> On Fri, Jan 09, 2026 at 10:07:00AM -0800, Ackerley Tng wrote:
> > Vishal Annapurve <vannapurve@...gle.com> writes:
> >
> > > On Fri, Jan 9, 2026 at 1:21 AM Yan Zhao <yan.y.zhao@...el.com> wrote:
> > >>
> > >> On Thu, Jan 08, 2026 at 12:11:14PM -0800, Ackerley Tng wrote:
> > >> > Yan Zhao <yan.y.zhao@...el.com> writes:
> > >> >
> > >> > > On Tue, Jan 06, 2026 at 03:43:29PM -0800, Sean Christopherson wrote:
> > >> > >> On Tue, Jan 06, 2026, Ackerley Tng wrote:
> > >> > >> > Sean Christopherson <seanjc@...gle.com> writes:
> > >> > >> >
> > >> > >> > > On Tue, Jan 06, 2026, Ackerley Tng wrote:
> > >> > >> > >> Vishal Annapurve <vannapurve@...gle.com> writes:
> > >> > >> > >>
> > >> > >> > >> > On Tue, Jan 6, 2026 at 2:19 AM Yan Zhao <yan.y.zhao@...el.com> wrote:
> > >> > >> > >> >>
> > >> > >> > >> >> - EPT mapping size and folio size
> > >> > >> > >> >>
> > >> > >> > >> >> This series is built upon the rule in KVM that the mapping size in the
> > >> > >> > >> >> KVM-managed secondary MMU is no larger than the backend folio size.
> > >> > >> > >> >>
> > >> > >> > >>
> > >> > >> > >> I'm not familiar with this rule and would like to find out more. Why is
> > >> > >> > >> this rule imposed?
> > >> > >> > >
> > >> > >> > > Because it's the only sane way to safely map memory into the guest? :-D
> > >> > >> > >
> > >> > >> > >> Is this rule there just because traditionally folio sizes also define the
> > >> > >> > >> limit of contiguity, and so the mapping size must not be greater than folio
> > >> > >> > >> size in case the block of memory represented by the folio is not contiguous?
> > >> > >> > >
> > >> > >> > > Pre-guest_memfd, KVM didn't care about folios. KVM's mapping size was (and still
> > >> > >> > > is) strictly bound by the host mapping size. That's handles contiguous addresses,
> > >> > >> > > but it _also_ handles contiguous protections (e.g. RWX) and other attributes.
> > >> > >> > >
> > >> > >> > >> In guest_memfd's case, even if the folio is split (just for refcount
> > >> > >> > >> tracking purposese on private to shared conversion), the memory is still
> > >> > >> > >> contiguous up to the original folio's size. Will the contiguity address
> > >> > >> > >> the concerns?
> > >> > >> > >
> > >> > >> > > Not really? Why would the folio be split if the memory _and its attributes_ are
> > >> > >> > > fully contiguous? If the attributes are mixed, KVM must not create a mapping
> > >> > >> > > spanning mixed ranges, i.e. with multiple folios.
> > >> > >> >
> > >> > >> > The folio can be split if any (or all) of the pages in a huge page range
> > >> > >> > are shared (in the CoCo sense). So in a 1G block of memory, even if the
> > >> > >> > attributes all read 0 (!KVM_MEMORY_ATTRIBUTE_PRIVATE), the folio
> > >> > >> > would be split, and the split folios are necessary for tracking users of
> > >> > >> > shared pages using struct page refcounts.
> > >> > >>
> > >> > >> Ahh, that's what the refcounting was referring to. Gotcha.
> > >> > >>
> > >> > >> > However the split folios in that 1G range are still fully contiguous.
> > >> > >> >
> > >> > >> > The process of conversion will split the EPT entries soon after the
> > >> > >> > folios are split so the rule remains upheld.
> > >> >
> > >> > Correction here: If we go with splitting from 1G to 4K uniformly on
> > >> > sharing, only the EPT entries around the shared 4K folio will have their
> > >> > page table entries split, so many of the EPT entries will be at 2M level
> > >> > though the folios are 4K sized. This would be last beyond the conversion
> > >> > process.
> > >> >
> > >> > > Overall, I don't think allowing folios smaller than the mappings while
> > >> > > conversion is in progress brings enough benefit.
> > >> > >
> > >> >
> > >> > I'll look into making the restructuring process always succeed, but off
> > >> > the top of my head that's hard because
> > >> >
> > >> > 1. HugeTLB Vmemmap Optimization code would have to be refactored to
> > >> > use pre-allocated pages, which is refactoring deep in HugeTLB code
> > >> >
> > >> > 2. If we want to split non-uniformly such that only the folios that are
> > >> > shared are 4K, and the remaining folios are as large as possible (PMD
> > >> > sized as much as possible), it gets complex to figure out how many
> > >> > pages to allocate ahead of time.
> > >> >
> > >> > So it's complex and will probably delay HugeTLB+conversion support even
> > >> > more!
> > >> >
> > >> > > Cons:
> > >> > > (1) TDX's zapping callback has no idea whether the zapping is caused by an
> > >> > > in-progress private-to-shared conversion or other reasons. It also has no
> > >> > > idea if the attributes of the underlying folios remain unchanged during an
> > >> > > in-progress private-to-shared conversion. Even if the assertion Ackerley
> > >> > > mentioned is true, it's not easy to drop the sanity checks in TDX's zapping
> > >> > > callback for in-progress private-to-shared conversion alone (which would
> > >> > > increase TDX's dependency on guest_memfd's specific implementation even if
> > >> > > it's feasible).
> > >> > >
> > >> > > Removing the sanity checks entirely in TDX's zapping callback is confusing
> > >> > > and would show a bad/false expectation from KVM -- what if a huge folio is
> > >> > > incorrectly split while it's still mapped in KVM (by a buggy guest_memfd or
> > >> > > others) in other conditions? And then do we still need the check in TDX's
> > >> > > mapping callback? If not, does it mean TDX huge pages can stop relying on
> > >> > > guest_memfd's ability to allocate huge folios, as KVM could still create
> > >> > > huge mappings as long as small folios are physically contiguous with
> > >> > > homogeneous memory attributes?
> > >> > >
> > >> > > (2) Allowing folios smaller than the mapping would require splitting S-EPT in
> > >> > > kvm_gmem_error_folio() before kvm_gmem_zap(). Though one may argue that the
> > >> > > invalidate lock held in __kvm_gmem_set_attributes() could guard against
> > >> > > concurrent kvm_gmem_error_folio(), it still doesn't seem clean and looks
> > >> > > error-prone. (This may also apply to kvm_gmem_migrate_folio() potentially).
> > >> > >
> > >> >
> > >> > I think the central question I have among all the above is what TDX
> > >> > needs to actually care about (putting aside what KVM's folio size/memory
> > >> > contiguity vs mapping level rule for a while).
> > >> >
> > >> > I think TDX code can check what it cares about (if required to aid
> > >> > debugging, as Dave suggested). Does TDX actually care about folio sizes,
> > >> > or does it actually care about memory contiguity and alignment?
> > >> TDX cares about memory contiguity. A single folio ensures memory contiguity.
> > >
> > > In this slightly unusual case, I think the guarantee needed here is
> > > that as long as a range is mapped into SEPT entries, guest_memfd
> > > ensures that the complete range stays private.
> > >
> > > i.e. I think it should be safe to rely on guest_memfd here,
> > > irrespective of the folio sizes:
> > > 1) KVM TDX stack should be able to reclaim the complete range when unmapping.
> > > 2) KVM TDX stack can assume that as long as memory is mapped in SEPT
> > > entries, guest_memfd will not let host userspace mappings to access
> > > guest private memory.
> > >
> > >>
> > >> Allowing one S-EPT mapping to cover multiple folios may also mean it's no longer
> > >> reasonable to pass "struct page" to tdh_phymem_page_wbinvd_hkid() for a
> > >> contiguous range larger than the page's folio range.
> > >
> > > What's the issue with passing the (struct page*, unsigned long nr_pages) pair?
> > >
> > >>
> > >> Additionally, we don't split private mappings in kvm_gmem_error_folio().
> > >> If smaller folios are allowed, splitting private mapping is required there.
> >
> > It was discussed before that for memory failure handling, we will want
> > to split huge pages, we will get to it! The trouble is that guest_memfd
> > took the page from HugeTLB (unlike buddy or HugeTLB which manages memory
> > from the ground up), so we'll still need to figure out it's okay to let
> > HugeTLB deal with it when freeing, and when I last looked, HugeTLB
> > doesn't actually deal with poisoned folios on freeing, so there's more
> > work to do on the HugeTLB side.
> >
> > This is a good point, although IIUC it is a separate issue. The need to
> > split private mappings on memory failure is not for confidentiality in
> > the TDX sense but to ensure that the guest doesn't use the failed
> > memory. In that case, contiguity is broken by the failed memory. The
> > folio is split, the private EPTs are split. The folio size should still
> > not be checked in TDX code. guest_memfd knows contiguity got broken, so
> > guest_memfd calls TDX code to split the EPTs.
>
> Hmm, maybe the key is that we need to split S-EPT first before allowing
> guest_memfd to split the backend folio. If splitting S-EPT fails, don't do the
> folio splitting.
>
> This is better than performing folio splitting while it's mapped as huge in
> S-EPT, since in the latter case, kvm_gmem_error_folio() needs to try to split
> S-EPT. If the S-EPT splitting fails, falling back to zapping the huge mapping in
> kvm_gmem_error_folio() would still trigger the over-zapping issue.
>
> In the primary MMU, it follows the rule of unmapping a folio before splitting,
> truncating, or migrating a folio. For S-EPT, considering the cost of zapping
> more ranges than necessary, maybe a trade-off is to always split S-EPT before
> allowing backend folio splitting.
>
> Does this look good to you?
So, the flow of converting 0-4KB from private to shared in a 1GB folio in
guest_memfd is:
a. If guest_memfd splits 1GB to 2MB first:
1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 2MB for the rest range.
2. split folio
3. zap the 0-4KB mapping.
b. If guest_memfd splits 1GB to 4KB directly:
1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 4KB for the rest range.
2. split folio
3. zap the 0-4KB mapping.
The flow of converting 0-2MB from private to shared in a 1GB folio in
guest_memfd is:
a. If guest_memfd splits 1GB to 2MB first:
1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 2MB for the rest range.
2. split folio
3. zap the 0-2MB mapping.
b. If guest_memfd splits 1GB to 4KB directly:
1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 4KB for the rest range.
2. split folio
3. zap the 0-2MB mapping.
> So, to convert a 2MB range from private to shared, even though guest_memfd will
> eventually zap the entire 2MB range, do the S-EPT splitting first! If it fails,
> don't split the backend folio.
>
> Even if folio splitting may fail later, it just leaves split S-EPT mappings,
> which matters little, especially after we support S-EPT promotion later.
>
> The benefit is that we don't need to worry even in the case when guest_memfd
> splits a 1GB folio directly to 4KB granularity, potentially introducing the
> over-zapping issue later.
>
> > > Yes, I believe splitting private mappings will be invoked to ensure
> > > that the whole huge folio is not unmapped from KVM due to an error on
> > > just a 4K page. Is that a problem?
> > >
> > > If splitting fails, the implementation can fall back to completely
> > > zapping the folio range.
> > >
> > >> (e.g., after splitting a 1GB folio to 4KB folios with 2MB mappings. Also, is it
> > >> possible for splitting a huge folio to fail partially, without merging the huge
> > >> folio back or further zapping?).
> >
> > The current stance is to allow splitting failures and not undo that
> > splitting failure, so there's no merge back to fix the splitting
> > failure. (Not set in stone yet, I think merging back could turn out to
> > be a requirement from the mm side, which comes with more complexity in
> > restructuring logic.)
> >
> > If it is not merged back on a split failure, the pages are still
> > contiguous, the pages are guaranteed contiguous while they are owned by
> > guest_memfd (even in the case of memory failure, if I get my way :P) so
> > TDX can still trust that.
> >
> > I think you're worried that on split failure some folios are split, but
> > the private EPTs for those are not split, but the memory for those
> > unsplit private EPTs are still contiguous, and on split failure we quit
> > early so guest_memfd still tracks the ranges as private.
> >
> > Privateness and contiguity are preserved so I think TDX should be good
> > with that? The TD can still run. IIUC it is part of the plan that on
> > splitting failure, conversion ioctl returns failure, guest is informed
> > of conversion failure so that it can do whatever it should do to clean
> > up.
> As above, what about the idea of always requesting KVM to split S-EPT before
> guest_memfd splits a folio?
>
> I think splitting S-EPT first is already required for all cases anyway, except
> for the private-to-shared conversion of a full 2MB or 1GB range.
>
> Requesting S-EPT splitting when it's about to do folio splitting is better than
> leaving huge mappings with split folios and having to patch things up here and
> there, just to make the single case of private-to-shared conversion easier.
>
> > > Yes, splitting can fail partially, but guest_memfd will not make the
> > > ranges available to host userspace and derivatives until:
> > > 1) The complete range to be converted is split to 4K granularity.
> > > 2) The complete range to be converted is zapped from KVM EPT mappings.
> > >
> > >> Not sure if there're other edge cases we're still missing.
> > >>
> >
> > As you said, at the core TDX is concerned about contiguity of the memory
> > ranges (start_addr, length) that it was given. Contiguity is guaranteed
> > by guest_memfd while the folio is in guest_memfd ownership up to the
> > boundaries of the original folio, before any restructuring. So if we're
> > looking for edge cases, I think they would be around
> > truncation. Can't think of anything now.
> Potentially, folio migration, if we support it in the future.
>
> > (guest_memfd will also ensure truncation of anything less than the
> > original size of the folio before restructuring is blocked, regardless
> > of the current size of the folio)
> > >> > Separately, KVM could also enforce the folio size/memory contiguity vs
> > >> > mapping level rule, but TDX code shouldn't enforce KVM's rules. So if
> > >> > the check is deemed necessary, it still shouldn't be in TDX code, I
> > >> > think.
> > >> >
> > >> > > Pro: Preventing zapping private memory until conversion is successful is good.
> > >> > >
> > >> > > However, could we achieve this benefit in other ways? For example, is it
> > >> > > possible to ensure hugetlb_restructuring_split_folio() can't fail by ensuring
> > >> > > split_entries() can't fail (via pre-allocation?) and disabling hugetlb_vmemmap
> > >> > > optimization? (hugetlb_vmemmap conversion is super slow according to my
> > >> > > observation and I always disable it).
> > >> >
> > >> > HugeTLB vmemmap optimization gives us 1.6% of memory in savings. For a
> > >> > huge VM, multiplied by a large number of hosts, this is not a trivial
> > >> > amount of memory. It's one of the key reasons why we are using HugeTLB
> > >> > in guest_memfd in the first place, other than to be able to get high
> > >> > level page table mappings. We want this in production.
> > >> >
> > >> > > Or pre-allocation for
> > >> > > vmemmap_remap_alloc()?
> > >> > >
> > >> >
> > >> > Will investigate if this is possible as mentioned above. Thanks for the
> > >> > suggestion again!
> > >> >
> > >> > > Dropping TDX's sanity check may only serve as our last resort. IMHO, zapping
> > >> > > private memory before conversion succeeds is still better than introducing the
> > >> > > mess between folio size and mapping size.
> > >> > >
> > >> > >> > I guess perhaps the question is, is it okay if the folios are smaller
> > >> > >> > than the mapping while conversion is in progress? Does the order matter
> > >> > >> > (split page table entries first vs split folios first)?
> > >> > >>
> > >> > >> Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is
> > >> > >> conceptually totally fine, i.e. I'm not totally opposed to adding support for
> > >> > >> mapping multiple guest_memfd folios with a single hugepage. As to whether we
> > >> > >> do (a) nothing, (b) change the refcounting, or (c) add support for mapping
> > >> > >> multiple folios in one page, probably comes down to which option provides "good
> > >> > >> enough" performance without incurring too much complexity.
> > >> >
> >
Powered by blists - more mailing lists