linux-kernel - Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGtprH-E1iizdDE5PD9E3UHXJHNiiu2H4du9NkVt6vNAhV=O4g@mail.gmail.com>
Date: Fri, 9 Jan 2026 08:12:46 -0800
From: Vishal Annapurve <vannapurve@...gle.com>
To: Yan Zhao <yan.y.zhao@...el.com>
Cc: Ackerley Tng <ackerleytng@...gle.com>, Sean Christopherson <seanjc@...gle.com>, pbonzini@...hat.com, 
	linux-kernel@...r.kernel.org, kvm@...r.kernel.org, x86@...nel.org, 
	rick.p.edgecombe@...el.com, dave.hansen@...el.com, kas@...nel.org, 
	tabba@...gle.com, michael.roth@....com, david@...nel.org, sagis@...gle.com, 
	vbabka@...e.cz, thomas.lendacky@....com, nik.borisov@...e.com, 
	pgonda@...gle.com, fan.du@...el.com, jun.miao@...el.com, 
	francescolavra.fl@...il.com, jgross@...e.com, ira.weiny@...el.com, 
	isaku.yamahata@...el.com, xiaoyao.li@...el.com, kai.huang@...el.com, 
	binbin.wu@...ux.intel.com, chao.p.peng@...el.com, chao.gao@...el.com
Subject: Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

On Fri, Jan 9, 2026 at 1:21 AM Yan Zhao <yan.y.zhao@...el.com> wrote:
>
> On Thu, Jan 08, 2026 at 12:11:14PM -0800, Ackerley Tng wrote:
> > Yan Zhao <yan.y.zhao@...el.com> writes:
> >
> > > On Tue, Jan 06, 2026 at 03:43:29PM -0800, Sean Christopherson wrote:
> > >> On Tue, Jan 06, 2026, Ackerley Tng wrote:
> > >> > Sean Christopherson <seanjc@...gle.com> writes:
> > >> >
> > >> > > On Tue, Jan 06, 2026, Ackerley Tng wrote:
> > >> > >> Vishal Annapurve <vannapurve@...gle.com> writes:
> > >> > >>
> > >> > >> > On Tue, Jan 6, 2026 at 2:19 AM Yan Zhao <yan.y.zhao@...el.com> wrote:
> > >> > >> >>
> > >> > >> >> - EPT mapping size and folio size
> > >> > >> >>
> > >> > >> >>   This series is built upon the rule in KVM that the mapping size in the
> > >> > >> >>   KVM-managed secondary MMU is no larger than the backend folio size.
> > >> > >> >>
> > >> > >>
> > >> > >> I'm not familiar with this rule and would like to find out more. Why is
> > >> > >> this rule imposed?
> > >> > >
> > >> > > Because it's the only sane way to safely map memory into the guest? :-D
> > >> > >
> > >> > >> Is this rule there just because traditionally folio sizes also define the
> > >> > >> limit of contiguity, and so the mapping size must not be greater than folio
> > >> > >> size in case the block of memory represented by the folio is not contiguous?
> > >> > >
> > >> > > Pre-guest_memfd, KVM didn't care about folios.  KVM's mapping size was (and still
> > >> > > is) strictly bound by the host mapping size.  That's handles contiguous addresses,
> > >> > > but it _also_ handles contiguous protections (e.g. RWX) and other attributes.
> > >> > >
> > >> > >> In guest_memfd's case, even if the folio is split (just for refcount
> > >> > >> tracking purposese on private to shared conversion), the memory is still
> > >> > >> contiguous up to the original folio's size. Will the contiguity address
> > >> > >> the concerns?
> > >> > >
> > >> > > Not really?  Why would the folio be split if the memory _and its attributes_ are
> > >> > > fully contiguous?  If the attributes are mixed, KVM must not create a mapping
> > >> > > spanning mixed ranges, i.e. with multiple folios.
> > >> >
> > >> > The folio can be split if any (or all) of the pages in a huge page range
> > >> > are shared (in the CoCo sense). So in a 1G block of memory, even if the
> > >> > attributes all read 0 (!KVM_MEMORY_ATTRIBUTE_PRIVATE), the folio
> > >> > would be split, and the split folios are necessary for tracking users of
> > >> > shared pages using struct page refcounts.
> > >>
> > >> Ahh, that's what the refcounting was referring to.  Gotcha.
> > >>
> > >> > However the split folios in that 1G range are still fully contiguous.
> > >> >
> > >> > The process of conversion will split the EPT entries soon after the
> > >> > folios are split so the rule remains upheld.
> >
> > Correction here: If we go with splitting from 1G to 4K uniformly on
> > sharing, only the EPT entries around the shared 4K folio will have their
> > page table entries split, so many of the EPT entries will be at 2M level
> > though the folios are 4K sized. This would be last beyond the conversion
> > process.
> >
> > > Overall, I don't think allowing folios smaller than the mappings while
> > > conversion is in progress brings enough benefit.
> > >
> >
> > I'll look into making the restructuring process always succeed, but off
> > the top of my head that's hard because
> >
> > 1. HugeTLB Vmemmap Optimization code would have to be refactored to
> >    use pre-allocated pages, which is refactoring deep in HugeTLB code
> >
> > 2. If we want to split non-uniformly such that only the folios that are
> >    shared are 4K, and the remaining folios are as large as possible (PMD
> >    sized as much as possible), it gets complex to figure out how many
> >    pages to allocate ahead of time.
> >
> > So it's complex and will probably delay HugeTLB+conversion support even
> > more!
> >
> > > Cons:
> > > (1) TDX's zapping callback has no idea whether the zapping is caused by an
> > >     in-progress private-to-shared conversion or other reasons. It also has no
> > >     idea if the attributes of the underlying folios remain unchanged during an
> > >     in-progress private-to-shared conversion. Even if the assertion Ackerley
> > >     mentioned is true, it's not easy to drop the sanity checks in TDX's zapping
> > >     callback for in-progress private-to-shared conversion alone (which would
> > >     increase TDX's dependency on guest_memfd's specific implementation even if
> > >     it's feasible).
> > >
> > >     Removing the sanity checks entirely in TDX's zapping callback is confusing
> > >     and would show a bad/false expectation from KVM -- what if a huge folio is
> > >     incorrectly split while it's still mapped in KVM (by a buggy guest_memfd or
> > >     others) in other conditions? And then do we still need the check in TDX's
> > >     mapping callback? If not, does it mean TDX huge pages can stop relying on
> > >     guest_memfd's ability to allocate huge folios, as KVM could still create
> > >     huge mappings as long as small folios are physically contiguous with
> > >     homogeneous memory attributes?
> > >
> > > (2) Allowing folios smaller than the mapping would require splitting S-EPT in
> > >     kvm_gmem_error_folio() before kvm_gmem_zap(). Though one may argue that the
> > >     invalidate lock held in __kvm_gmem_set_attributes() could guard against
> > >     concurrent kvm_gmem_error_folio(), it still doesn't seem clean and looks
> > >     error-prone. (This may also apply to kvm_gmem_migrate_folio() potentially).
> > >
> >
> > I think the central question I have among all the above is what TDX
> > needs to actually care about (putting aside what KVM's folio size/memory
> > contiguity vs mapping level rule for a while).
> >
> > I think TDX code can check what it cares about (if required to aid
> > debugging, as Dave suggested). Does TDX actually care about folio sizes,
> > or does it actually care about memory contiguity and alignment?
> TDX cares about memory contiguity. A single folio ensures memory contiguity.

In this slightly unusual case, I think the guarantee needed here is
that as long as a range is mapped into SEPT entries, guest_memfd
ensures that the complete range stays private.

i.e. I think it should be safe to rely on guest_memfd here,
irrespective of the folio sizes:
1) KVM TDX stack should be able to reclaim the complete range when unmapping.
2) KVM TDX stack can assume that as long as memory is mapped in SEPT
entries, guest_memfd will not let host userspace mappings to access
guest private memory.

>
> Allowing one S-EPT mapping to cover multiple folios may also mean it's no longer
> reasonable to pass "struct page" to tdh_phymem_page_wbinvd_hkid() for a
> contiguous range larger than the page's folio range.

What's the issue with passing the (struct page*, unsigned long nr_pages) pair?

>
> Additionally, we don't split private mappings in kvm_gmem_error_folio().
> If smaller folios are allowed, splitting private mapping is required there.

Yes, I believe splitting private mappings will be invoked to ensure
that the whole huge folio is not unmapped from KVM due to an error on
just a 4K page. Is that a problem?

If splitting fails, the implementation can fall back to completely
zapping the folio range.

> (e.g., after splitting a 1GB folio to 4KB folios with 2MB mappings. Also, is it
> possible for splitting a huge folio to fail partially, without merging the huge
> folio back or further zapping?).

Yes, splitting can fail partially, but guest_memfd will not make the
ranges available to host userspace and derivatives until:
1) The complete range to be converted is split to 4K granularity.
2) The complete range to be converted is zapped from KVM EPT mappings.

> Not sure if there're other edge cases we're still missing.
>
> > Separately, KVM could also enforce the folio size/memory contiguity vs
> > mapping level rule, but TDX code shouldn't enforce KVM's rules. So if
> > the check is deemed necessary, it still shouldn't be in TDX code, I
> > think.
> >
> > > Pro: Preventing zapping private memory until conversion is successful is good.
> > >
> > > However, could we achieve this benefit in other ways? For example, is it
> > > possible to ensure hugetlb_restructuring_split_folio() can't fail by ensuring
> > > split_entries() can't fail (via pre-allocation?) and disabling hugetlb_vmemmap
> > > optimization? (hugetlb_vmemmap conversion is super slow according to my
> > > observation and I always disable it).
> >
> > HugeTLB vmemmap optimization gives us 1.6% of memory in savings. For a
> > huge VM, multiplied by a large number of hosts, this is not a trivial
> > amount of memory. It's one of the key reasons why we are using HugeTLB
> > in guest_memfd in the first place, other than to be able to get high
> > level page table mappings. We want this in production.
> >
> > > Or pre-allocation for
> > > vmemmap_remap_alloc()?
> > >
> >
> > Will investigate if this is possible as mentioned above. Thanks for the
> > suggestion again!
> >
> > > Dropping TDX's sanity check may only serve as our last resort. IMHO, zapping
> > > private memory before conversion succeeds is still better than introducing the
> > > mess between folio size and mapping size.
> > >
> > >> > I guess perhaps the question is, is it okay if the folios are smaller
> > >> > than the mapping while conversion is in progress? Does the order matter
> > >> > (split page table entries first vs split folios first)?
> > >>
> > >> Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is
> > >> conceptually totally fine, i.e. I'm not totally opposed to adding support for
> > >> mapping multiple guest_memfd folios with a single hugepage.   As to whether we
> > >> do (a) nothing, (b) change the refcounting, or (c) add support for mapping
> > >> multiple folios in one page, probably comes down to which option provides "good
> > >> enough" performance without incurring too much complexity.
> >