linux-kernel - Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aWRQ2xyc9coA6aCg@yzhao56-desk.sh.intel.com>
Date: Mon, 12 Jan 2026 09:39:39 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Ackerley Tng <ackerleytng@...gle.com>
CC: Vishal Annapurve <vannapurve@...gle.com>, Sean Christopherson
	<seanjc@...gle.com>, <pbonzini@...hat.com>, <linux-kernel@...r.kernel.org>,
	<kvm@...r.kernel.org>, <x86@...nel.org>, <rick.p.edgecombe@...el.com>,
	<dave.hansen@...el.com>, <kas@...nel.org>, <tabba@...gle.com>,
	<michael.roth@....com>, <david@...nel.org>, <sagis@...gle.com>,
	<vbabka@...e.cz>, <thomas.lendacky@....com>, <nik.borisov@...e.com>,
	<pgonda@...gle.com>, <fan.du@...el.com>, <jun.miao@...el.com>,
	<francescolavra.fl@...il.com>, <jgross@...e.com>, <ira.weiny@...el.com>,
	<isaku.yamahata@...el.com>, <xiaoyao.li@...el.com>, <kai.huang@...el.com>,
	<binbin.wu@...ux.intel.com>, <chao.p.peng@...el.com>, <chao.gao@...el.com>
Subject: Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

On Fri, Jan 09, 2026 at 10:07:00AM -0800, Ackerley Tng wrote:
> Vishal Annapurve <vannapurve@...gle.com> writes:
> 
> > On Fri, Jan 9, 2026 at 1:21 AM Yan Zhao <yan.y.zhao@...el.com> wrote:
> >>
> >> On Thu, Jan 08, 2026 at 12:11:14PM -0800, Ackerley Tng wrote:
> >> > Yan Zhao <yan.y.zhao@...el.com> writes:
> >> >
> >> > > On Tue, Jan 06, 2026 at 03:43:29PM -0800, Sean Christopherson wrote:
> >> > >> On Tue, Jan 06, 2026, Ackerley Tng wrote:
> >> > >> > Sean Christopherson <seanjc@...gle.com> writes:
> >> > >> >
> >> > >> > > On Tue, Jan 06, 2026, Ackerley Tng wrote:
> >> > >> > >> Vishal Annapurve <vannapurve@...gle.com> writes:
> >> > >> > >>
> >> > >> > >> > On Tue, Jan 6, 2026 at 2:19 AM Yan Zhao <yan.y.zhao@...el.com> wrote:
> >> > >> > >> >>
> >> > >> > >> >> - EPT mapping size and folio size
> >> > >> > >> >>
> >> > >> > >> >>   This series is built upon the rule in KVM that the mapping size in the
> >> > >> > >> >>   KVM-managed secondary MMU is no larger than the backend folio size.
> >> > >> > >> >>
> >> > >> > >>
> >> > >> > >> I'm not familiar with this rule and would like to find out more. Why is
> >> > >> > >> this rule imposed?
> >> > >> > >
> >> > >> > > Because it's the only sane way to safely map memory into the guest? :-D
> >> > >> > >
> >> > >> > >> Is this rule there just because traditionally folio sizes also define the
> >> > >> > >> limit of contiguity, and so the mapping size must not be greater than folio
> >> > >> > >> size in case the block of memory represented by the folio is not contiguous?
> >> > >> > >
> >> > >> > > Pre-guest_memfd, KVM didn't care about folios.  KVM's mapping size was (and still
> >> > >> > > is) strictly bound by the host mapping size.  That's handles contiguous addresses,
> >> > >> > > but it _also_ handles contiguous protections (e.g. RWX) and other attributes.
> >> > >> > >
> >> > >> > >> In guest_memfd's case, even if the folio is split (just for refcount
> >> > >> > >> tracking purposese on private to shared conversion), the memory is still
> >> > >> > >> contiguous up to the original folio's size. Will the contiguity address
> >> > >> > >> the concerns?
> >> > >> > >
> >> > >> > > Not really?  Why would the folio be split if the memory _and its attributes_ are
> >> > >> > > fully contiguous?  If the attributes are mixed, KVM must not create a mapping
> >> > >> > > spanning mixed ranges, i.e. with multiple folios.
> >> > >> >
> >> > >> > The folio can be split if any (or all) of the pages in a huge page range
> >> > >> > are shared (in the CoCo sense). So in a 1G block of memory, even if the
> >> > >> > attributes all read 0 (!KVM_MEMORY_ATTRIBUTE_PRIVATE), the folio
> >> > >> > would be split, and the split folios are necessary for tracking users of
> >> > >> > shared pages using struct page refcounts.
> >> > >>
> >> > >> Ahh, that's what the refcounting was referring to.  Gotcha.
> >> > >>
> >> > >> > However the split folios in that 1G range are still fully contiguous.
> >> > >> >
> >> > >> > The process of conversion will split the EPT entries soon after the
> >> > >> > folios are split so the rule remains upheld.
> >> >
> >> > Correction here: If we go with splitting from 1G to 4K uniformly on
> >> > sharing, only the EPT entries around the shared 4K folio will have their
> >> > page table entries split, so many of the EPT entries will be at 2M level
> >> > though the folios are 4K sized. This would be last beyond the conversion
> >> > process.
> >> >
> >> > > Overall, I don't think allowing folios smaller than the mappings while
> >> > > conversion is in progress brings enough benefit.
> >> > >
> >> >
> >> > I'll look into making the restructuring process always succeed, but off
> >> > the top of my head that's hard because
> >> >
> >> > 1. HugeTLB Vmemmap Optimization code would have to be refactored to
> >> >    use pre-allocated pages, which is refactoring deep in HugeTLB code
> >> >
> >> > 2. If we want to split non-uniformly such that only the folios that are
> >> >    shared are 4K, and the remaining folios are as large as possible (PMD
> >> >    sized as much as possible), it gets complex to figure out how many
> >> >    pages to allocate ahead of time.
> >> >
> >> > So it's complex and will probably delay HugeTLB+conversion support even
> >> > more!
> >> >
> >> > > Cons:
> >> > > (1) TDX's zapping callback has no idea whether the zapping is caused by an
> >> > >     in-progress private-to-shared conversion or other reasons. It also has no
> >> > >     idea if the attributes of the underlying folios remain unchanged during an
> >> > >     in-progress private-to-shared conversion. Even if the assertion Ackerley
> >> > >     mentioned is true, it's not easy to drop the sanity checks in TDX's zapping
> >> > >     callback for in-progress private-to-shared conversion alone (which would
> >> > >     increase TDX's dependency on guest_memfd's specific implementation even if
> >> > >     it's feasible).
> >> > >
> >> > >     Removing the sanity checks entirely in TDX's zapping callback is confusing
> >> > >     and would show a bad/false expectation from KVM -- what if a huge folio is
> >> > >     incorrectly split while it's still mapped in KVM (by a buggy guest_memfd or
> >> > >     others) in other conditions? And then do we still need the check in TDX's
> >> > >     mapping callback? If not, does it mean TDX huge pages can stop relying on
> >> > >     guest_memfd's ability to allocate huge folios, as KVM could still create
> >> > >     huge mappings as long as small folios are physically contiguous with
> >> > >     homogeneous memory attributes?
> >> > >
> >> > > (2) Allowing folios smaller than the mapping would require splitting S-EPT in
> >> > >     kvm_gmem_error_folio() before kvm_gmem_zap(). Though one may argue that the
> >> > >     invalidate lock held in __kvm_gmem_set_attributes() could guard against
> >> > >     concurrent kvm_gmem_error_folio(), it still doesn't seem clean and looks
> >> > >     error-prone. (This may also apply to kvm_gmem_migrate_folio() potentially).
> >> > >
> >> >
> >> > I think the central question I have among all the above is what TDX
> >> > needs to actually care about (putting aside what KVM's folio size/memory
> >> > contiguity vs mapping level rule for a while).
> >> >
> >> > I think TDX code can check what it cares about (if required to aid
> >> > debugging, as Dave suggested). Does TDX actually care about folio sizes,
> >> > or does it actually care about memory contiguity and alignment?
> >> TDX cares about memory contiguity. A single folio ensures memory contiguity.
> >
> > In this slightly unusual case, I think the guarantee needed here is
> > that as long as a range is mapped into SEPT entries, guest_memfd
> > ensures that the complete range stays private.
> >
> > i.e. I think it should be safe to rely on guest_memfd here,
> > irrespective of the folio sizes:
> > 1) KVM TDX stack should be able to reclaim the complete range when unmapping.
> > 2) KVM TDX stack can assume that as long as memory is mapped in SEPT
> > entries, guest_memfd will not let host userspace mappings to access
> > guest private memory.
> >
> >>
> >> Allowing one S-EPT mapping to cover multiple folios may also mean it's no longer
> >> reasonable to pass "struct page" to tdh_phymem_page_wbinvd_hkid() for a
> >> contiguous range larger than the page's folio range.
> >
> > What's the issue with passing the (struct page*, unsigned long nr_pages) pair?
> >
> >>
> >> Additionally, we don't split private mappings in kvm_gmem_error_folio().
> >> If smaller folios are allowed, splitting private mapping is required there.
> 
> It was discussed before that for memory failure handling, we will want
> to split huge pages, we will get to it! The trouble is that guest_memfd
> took the page from HugeTLB (unlike buddy or HugeTLB which manages memory
> from the ground up), so we'll still need to figure out it's okay to let
> HugeTLB deal with it when freeing, and when I last looked, HugeTLB
> doesn't actually deal with poisoned folios on freeing, so there's more
> work to do on the HugeTLB side.
> 
> This is a good point, although IIUC it is a separate issue. The need to
> split private mappings on memory failure is not for confidentiality in
> the TDX sense but to ensure that the guest doesn't use the failed
> memory. In that case, contiguity is broken by the failed memory. The
> folio is split, the private EPTs are split. The folio size should still
> not be checked in TDX code. guest_memfd knows contiguity got broken, so
> guest_memfd calls TDX code to split the EPTs.

Hmm, maybe the key is that we need to split S-EPT first before allowing
guest_memfd to split the backend folio. If splitting S-EPT fails, don't do the
folio splitting.

This is better than performing folio splitting while it's mapped as huge in
S-EPT, since in the latter case, kvm_gmem_error_folio() needs to try to split
S-EPT. If the S-EPT splitting fails, falling back to zapping the huge mapping in
kvm_gmem_error_folio() would still trigger the over-zapping issue.

In the primary MMU, it follows the rule of unmapping a folio before splitting,
truncating, or migrating a folio. For S-EPT, considering the cost of zapping
more ranges than necessary, maybe a trade-off is to always split S-EPT before
allowing backend folio splitting.

Does this look good to you?

So, to convert a 2MB range from private to shared, even though guest_memfd will
eventually zap the entire 2MB range, do the S-EPT splitting first! If it fails,
don't split the backend folio.

Even if folio splitting may fail later, it just leaves split S-EPT mappings,
which matters little, especially after we support S-EPT promotion later.

The benefit is that we don't need to worry even in the case when guest_memfd
splits a 1GB folio directly to 4KB granularity, potentially introducing the
over-zapping issue later.

> > Yes, I believe splitting private mappings will be invoked to ensure
> > that the whole huge folio is not unmapped from KVM due to an error on
> > just a 4K page. Is that a problem?
> >
> > If splitting fails, the implementation can fall back to completely
> > zapping the folio range.
> >
> >> (e.g., after splitting a 1GB folio to 4KB folios with 2MB mappings. Also, is it
> >> possible for splitting a huge folio to fail partially, without merging the huge
> >> folio back or further zapping?).
> 
> The current stance is to allow splitting failures and not undo that
> splitting failure, so there's no merge back to fix the splitting
> failure. (Not set in stone yet, I think merging back could turn out to
> be a requirement from the mm side, which comes with more complexity in
> restructuring logic.)
> 
> If it is not merged back on a split failure, the pages are still
> contiguous, the pages are guaranteed contiguous while they are owned by
> guest_memfd (even in the case of memory failure, if I get my way :P) so
> TDX can still trust that.
> 
> I think you're worried that on split failure some folios are split, but
> the private EPTs for those are not split, but the memory for those
> unsplit private EPTs are still contiguous, and on split failure we quit
> early so guest_memfd still tracks the ranges as private.
> 
> Privateness and contiguity are preserved so I think TDX should be good
> with that? The TD can still run. IIUC it is part of the plan that on
> splitting failure, conversion ioctl returns failure, guest is informed
> of conversion failure so that it can do whatever it should do to clean
> up.
As above, what about the idea of always requesting KVM to split S-EPT before
guest_memfd splits a folio?

I think splitting S-EPT first is already required for all cases anyway, except
for the private-to-shared conversion of a full 2MB or 1GB range.

Requesting S-EPT splitting when it's about to do folio splitting is better than
leaving huge mappings with split folios and having to patch things up here and
there, just to make the single case of private-to-shared conversion easier.

> > Yes, splitting can fail partially, but guest_memfd will not make the
> > ranges available to host userspace and derivatives until:
> > 1) The complete range to be converted is split to 4K granularity.
> > 2) The complete range to be converted is zapped from KVM EPT mappings.
> >
> >> Not sure if there're other edge cases we're still missing.
> >>
> 
> As you said, at the core TDX is concerned about contiguity of the memory
> ranges (start_addr, length) that it was given. Contiguity is guaranteed
> by guest_memfd while the folio is in guest_memfd ownership up to the
> boundaries of the original folio, before any restructuring. So if we're
> looking for edge cases, I think they would be around
> truncation. Can't think of anything now.
Potentially, folio migration, if we support it in the future.

> (guest_memfd will also ensure truncation of anything less than the
> original size of the folio before restructuring is blocked, regardless
> of the current size of the folio)
> >> > Separately, KVM could also enforce the folio size/memory contiguity vs
> >> > mapping level rule, but TDX code shouldn't enforce KVM's rules. So if
> >> > the check is deemed necessary, it still shouldn't be in TDX code, I
> >> > think.
> >> >
> >> > > Pro: Preventing zapping private memory until conversion is successful is good.
> >> > >
> >> > > However, could we achieve this benefit in other ways? For example, is it
> >> > > possible to ensure hugetlb_restructuring_split_folio() can't fail by ensuring
> >> > > split_entries() can't fail (via pre-allocation?) and disabling hugetlb_vmemmap
> >> > > optimization? (hugetlb_vmemmap conversion is super slow according to my
> >> > > observation and I always disable it).
> >> >
> >> > HugeTLB vmemmap optimization gives us 1.6% of memory in savings. For a
> >> > huge VM, multiplied by a large number of hosts, this is not a trivial
> >> > amount of memory. It's one of the key reasons why we are using HugeTLB
> >> > in guest_memfd in the first place, other than to be able to get high
> >> > level page table mappings. We want this in production.
> >> >
> >> > > Or pre-allocation for
> >> > > vmemmap_remap_alloc()?
> >> > >
> >> >
> >> > Will investigate if this is possible as mentioned above. Thanks for the
> >> > suggestion again!
> >> >
> >> > > Dropping TDX's sanity check may only serve as our last resort. IMHO, zapping
> >> > > private memory before conversion succeeds is still better than introducing the
> >> > > mess between folio size and mapping size.
> >> > >
> >> > >> > I guess perhaps the question is, is it okay if the folios are smaller
> >> > >> > than the mapping while conversion is in progress? Does the order matter
> >> > >> > (split page table entries first vs split folios first)?
> >> > >>
> >> > >> Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is
> >> > >> conceptually totally fine, i.e. I'm not totally opposed to adding support for
> >> > >> mapping multiple guest_memfd folios with a single hugepage.   As to whether we
> >> > >> do (a) nothing, (b) change the refcounting, or (c) add support for mapping
> >> > >> multiple folios in one page, probably comes down to which option provides "good
> >> > >> enough" performance without incurring too much complexity.
> >> >
>