linux-kernel - Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aWDH3Z/bjA9unACB@yzhao56-desk.sh.intel.com>
Date: Fri, 9 Jan 2026 17:18:21 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Ackerley Tng <ackerleytng@...gle.com>
CC: Sean Christopherson <seanjc@...gle.com>, Vishal Annapurve
	<vannapurve@...gle.com>, <pbonzini@...hat.com>,
	<linux-kernel@...r.kernel.org>, <kvm@...r.kernel.org>, <x86@...nel.org>,
	<rick.p.edgecombe@...el.com>, <dave.hansen@...el.com>, <kas@...nel.org>,
	<tabba@...gle.com>, <michael.roth@....com>, <david@...nel.org>,
	<sagis@...gle.com>, <vbabka@...e.cz>, <thomas.lendacky@....com>,
	<nik.borisov@...e.com>, <pgonda@...gle.com>, <fan.du@...el.com>,
	<jun.miao@...el.com>, <francescolavra.fl@...il.com>, <jgross@...e.com>,
	<ira.weiny@...el.com>, <isaku.yamahata@...el.com>, <xiaoyao.li@...el.com>,
	<kai.huang@...el.com>, <binbin.wu@...ux.intel.com>, <chao.p.peng@...el.com>,
	<chao.gao@...el.com>
Subject: Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

On Thu, Jan 08, 2026 at 12:11:14PM -0800, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@...el.com> writes:
> 
> > On Tue, Jan 06, 2026 at 03:43:29PM -0800, Sean Christopherson wrote:
> >> On Tue, Jan 06, 2026, Ackerley Tng wrote:
> >> > Sean Christopherson <seanjc@...gle.com> writes:
> >> >
> >> > > On Tue, Jan 06, 2026, Ackerley Tng wrote:
> >> > >> Vishal Annapurve <vannapurve@...gle.com> writes:
> >> > >>
> >> > >> > On Tue, Jan 6, 2026 at 2:19 AM Yan Zhao <yan.y.zhao@...el.com> wrote:
> >> > >> >>
> >> > >> >> - EPT mapping size and folio size
> >> > >> >>
> >> > >> >>   This series is built upon the rule in KVM that the mapping size in the
> >> > >> >>   KVM-managed secondary MMU is no larger than the backend folio size.
> >> > >> >>
> >> > >>
> >> > >> I'm not familiar with this rule and would like to find out more. Why is
> >> > >> this rule imposed?
> >> > >
> >> > > Because it's the only sane way to safely map memory into the guest? :-D
> >> > >
> >> > >> Is this rule there just because traditionally folio sizes also define the
> >> > >> limit of contiguity, and so the mapping size must not be greater than folio
> >> > >> size in case the block of memory represented by the folio is not contiguous?
> >> > >
> >> > > Pre-guest_memfd, KVM didn't care about folios.  KVM's mapping size was (and still
> >> > > is) strictly bound by the host mapping size.  That's handles contiguous addresses,
> >> > > but it _also_ handles contiguous protections (e.g. RWX) and other attributes.
> >> > >
> >> > >> In guest_memfd's case, even if the folio is split (just for refcount
> >> > >> tracking purposese on private to shared conversion), the memory is still
> >> > >> contiguous up to the original folio's size. Will the contiguity address
> >> > >> the concerns?
> >> > >
> >> > > Not really?  Why would the folio be split if the memory _and its attributes_ are
> >> > > fully contiguous?  If the attributes are mixed, KVM must not create a mapping
> >> > > spanning mixed ranges, i.e. with multiple folios.
> >> >
> >> > The folio can be split if any (or all) of the pages in a huge page range
> >> > are shared (in the CoCo sense). So in a 1G block of memory, even if the
> >> > attributes all read 0 (!KVM_MEMORY_ATTRIBUTE_PRIVATE), the folio
> >> > would be split, and the split folios are necessary for tracking users of
> >> > shared pages using struct page refcounts.
> >>
> >> Ahh, that's what the refcounting was referring to.  Gotcha.
> >>
> >> > However the split folios in that 1G range are still fully contiguous.
> >> >
> >> > The process of conversion will split the EPT entries soon after the
> >> > folios are split so the rule remains upheld.
> 
> Correction here: If we go with splitting from 1G to 4K uniformly on
> sharing, only the EPT entries around the shared 4K folio will have their
> page table entries split, so many of the EPT entries will be at 2M level
> though the folios are 4K sized. This would be last beyond the conversion
> process.
> 
> > Overall, I don't think allowing folios smaller than the mappings while
> > conversion is in progress brings enough benefit.
> >
> 
> I'll look into making the restructuring process always succeed, but off
> the top of my head that's hard because
> 
> 1. HugeTLB Vmemmap Optimization code would have to be refactored to
>    use pre-allocated pages, which is refactoring deep in HugeTLB code
> 
> 2. If we want to split non-uniformly such that only the folios that are
>    shared are 4K, and the remaining folios are as large as possible (PMD
>    sized as much as possible), it gets complex to figure out how many
>    pages to allocate ahead of time.
> 
> So it's complex and will probably delay HugeTLB+conversion support even
> more!
> 
> > Cons:
> > (1) TDX's zapping callback has no idea whether the zapping is caused by an
> >     in-progress private-to-shared conversion or other reasons. It also has no
> >     idea if the attributes of the underlying folios remain unchanged during an
> >     in-progress private-to-shared conversion. Even if the assertion Ackerley
> >     mentioned is true, it's not easy to drop the sanity checks in TDX's zapping
> >     callback for in-progress private-to-shared conversion alone (which would
> >     increase TDX's dependency on guest_memfd's specific implementation even if
> >     it's feasible).
> >
> >     Removing the sanity checks entirely in TDX's zapping callback is confusing
> >     and would show a bad/false expectation from KVM -- what if a huge folio is
> >     incorrectly split while it's still mapped in KVM (by a buggy guest_memfd or
> >     others) in other conditions? And then do we still need the check in TDX's
> >     mapping callback? If not, does it mean TDX huge pages can stop relying on
> >     guest_memfd's ability to allocate huge folios, as KVM could still create
> >     huge mappings as long as small folios are physically contiguous with
> >     homogeneous memory attributes?
> >
> > (2) Allowing folios smaller than the mapping would require splitting S-EPT in
> >     kvm_gmem_error_folio() before kvm_gmem_zap(). Though one may argue that the
> >     invalidate lock held in __kvm_gmem_set_attributes() could guard against
> >     concurrent kvm_gmem_error_folio(), it still doesn't seem clean and looks
> >     error-prone. (This may also apply to kvm_gmem_migrate_folio() potentially).
> >
> 
> I think the central question I have among all the above is what TDX
> needs to actually care about (putting aside what KVM's folio size/memory
> contiguity vs mapping level rule for a while).
> 
> I think TDX code can check what it cares about (if required to aid
> debugging, as Dave suggested). Does TDX actually care about folio sizes,
> or does it actually care about memory contiguity and alignment?
TDX cares about memory contiguity. A single folio ensures memory contiguity.

Allowing one S-EPT mapping to cover multiple folios may also mean it's no longer
reasonable to pass "struct page" to tdh_phymem_page_wbinvd_hkid() for a
contiguous range larger than the page's folio range.

Additionally, we don't split private mappings in kvm_gmem_error_folio().
If smaller folios are allowed, splitting private mapping is required there.
(e.g., after splitting a 1GB folio to 4KB folios with 2MB mappings. Also, is it
possible for splitting a huge folio to fail partially, without merging the huge
folio back or further zapping?).
Not sure if there're other edge cases we're still missing.

> Separately, KVM could also enforce the folio size/memory contiguity vs
> mapping level rule, but TDX code shouldn't enforce KVM's rules. So if
> the check is deemed necessary, it still shouldn't be in TDX code, I
> think.
> 
> > Pro: Preventing zapping private memory until conversion is successful is good.
> >
> > However, could we achieve this benefit in other ways? For example, is it
> > possible to ensure hugetlb_restructuring_split_folio() can't fail by ensuring
> > split_entries() can't fail (via pre-allocation?) and disabling hugetlb_vmemmap
> > optimization? (hugetlb_vmemmap conversion is super slow according to my
> > observation and I always disable it).
> 
> HugeTLB vmemmap optimization gives us 1.6% of memory in savings. For a
> huge VM, multiplied by a large number of hosts, this is not a trivial
> amount of memory. It's one of the key reasons why we are using HugeTLB
> in guest_memfd in the first place, other than to be able to get high
> level page table mappings. We want this in production.
> 
> > Or pre-allocation for
> > vmemmap_remap_alloc()?
> >
> 
> Will investigate if this is possible as mentioned above. Thanks for the
> suggestion again!
> 
> > Dropping TDX's sanity check may only serve as our last resort. IMHO, zapping
> > private memory before conversion succeeds is still better than introducing the
> > mess between folio size and mapping size.
> >
> >> > I guess perhaps the question is, is it okay if the folios are smaller
> >> > than the mapping while conversion is in progress? Does the order matter
> >> > (split page table entries first vs split folios first)?
> >>
> >> Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is
> >> conceptually totally fine, i.e. I'm not totally opposed to adding support for
> >> mapping multiple guest_memfd folios with a single hugepage.   As to whether we
> >> do (a) nothing, (b) change the refcounting, or (c) add support for mapping
> >> multiple folios in one page, probably comes down to which option provides "good
> >> enough" performance without incurring too much complexity.
>