linux-kernel - Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEvNRgGk73cNFSTBB2p4Jbc-KS6YhU0WSd0pv9JVDArvRd=v4g@mail.gmail.com>
Date: Fri, 9 Jan 2026 10:07:00 -0800
From: Ackerley Tng <ackerleytng@...gle.com>
To: Vishal Annapurve <vannapurve@...gle.com>, Yan Zhao <yan.y.zhao@...el.com>
Cc: Sean Christopherson <seanjc@...gle.com>, pbonzini@...hat.com, linux-kernel@...r.kernel.org, 
	kvm@...r.kernel.org, x86@...nel.org, rick.p.edgecombe@...el.com, 
	dave.hansen@...el.com, kas@...nel.org, tabba@...gle.com, michael.roth@....com, 
	david@...nel.org, sagis@...gle.com, vbabka@...e.cz, thomas.lendacky@....com, 
	nik.borisov@...e.com, pgonda@...gle.com, fan.du@...el.com, jun.miao@...el.com, 
	francescolavra.fl@...il.com, jgross@...e.com, ira.weiny@...el.com, 
	isaku.yamahata@...el.com, xiaoyao.li@...el.com, kai.huang@...el.com, 
	binbin.wu@...ux.intel.com, chao.p.peng@...el.com, chao.gao@...el.com
Subject: Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Vishal Annapurve <vannapurve@...gle.com> writes:

> On Fri, Jan 9, 2026 at 1:21 AM Yan Zhao <yan.y.zhao@...el.com> wrote:
>>
>> On Thu, Jan 08, 2026 at 12:11:14PM -0800, Ackerley Tng wrote:
>> > Yan Zhao <yan.y.zhao@...el.com> writes:
>> >
>> > > On Tue, Jan 06, 2026 at 03:43:29PM -0800, Sean Christopherson wrote:
>> > >> On Tue, Jan 06, 2026, Ackerley Tng wrote:
>> > >> > Sean Christopherson <seanjc@...gle.com> writes:
>> > >> >
>> > >> > > On Tue, Jan 06, 2026, Ackerley Tng wrote:
>> > >> > >> Vishal Annapurve <vannapurve@...gle.com> writes:
>> > >> > >>
>> > >> > >> > On Tue, Jan 6, 2026 at 2:19 AM Yan Zhao <yan.y.zhao@...el.com> wrote:
>> > >> > >> >>
>> > >> > >> >> - EPT mapping size and folio size
>> > >> > >> >>
>> > >> > >> >>   This series is built upon the rule in KVM that the mapping size in the
>> > >> > >> >>   KVM-managed secondary MMU is no larger than the backend folio size.
>> > >> > >> >>
>> > >> > >>
>> > >> > >> I'm not familiar with this rule and would like to find out more. Why is
>> > >> > >> this rule imposed?
>> > >> > >
>> > >> > > Because it's the only sane way to safely map memory into the guest? :-D
>> > >> > >
>> > >> > >> Is this rule there just because traditionally folio sizes also define the
>> > >> > >> limit of contiguity, and so the mapping size must not be greater than folio
>> > >> > >> size in case the block of memory represented by the folio is not contiguous?
>> > >> > >
>> > >> > > Pre-guest_memfd, KVM didn't care about folios.  KVM's mapping size was (and still
>> > >> > > is) strictly bound by the host mapping size.  That's handles contiguous addresses,
>> > >> > > but it _also_ handles contiguous protections (e.g. RWX) and other attributes.
>> > >> > >
>> > >> > >> In guest_memfd's case, even if the folio is split (just for refcount
>> > >> > >> tracking purposese on private to shared conversion), the memory is still
>> > >> > >> contiguous up to the original folio's size. Will the contiguity address
>> > >> > >> the concerns?
>> > >> > >
>> > >> > > Not really?  Why would the folio be split if the memory _and its attributes_ are
>> > >> > > fully contiguous?  If the attributes are mixed, KVM must not create a mapping
>> > >> > > spanning mixed ranges, i.e. with multiple folios.
>> > >> >
>> > >> > The folio can be split if any (or all) of the pages in a huge page range
>> > >> > are shared (in the CoCo sense). So in a 1G block of memory, even if the
>> > >> > attributes all read 0 (!KVM_MEMORY_ATTRIBUTE_PRIVATE), the folio
>> > >> > would be split, and the split folios are necessary for tracking users of
>> > >> > shared pages using struct page refcounts.
>> > >>
>> > >> Ahh, that's what the refcounting was referring to.  Gotcha.
>> > >>
>> > >> > However the split folios in that 1G range are still fully contiguous.
>> > >> >
>> > >> > The process of conversion will split the EPT entries soon after the
>> > >> > folios are split so the rule remains upheld.
>> >
>> > Correction here: If we go with splitting from 1G to 4K uniformly on
>> > sharing, only the EPT entries around the shared 4K folio will have their
>> > page table entries split, so many of the EPT entries will be at 2M level
>> > though the folios are 4K sized. This would be last beyond the conversion
>> > process.
>> >
>> > > Overall, I don't think allowing folios smaller than the mappings while
>> > > conversion is in progress brings enough benefit.
>> > >
>> >
>> > I'll look into making the restructuring process always succeed, but off
>> > the top of my head that's hard because
>> >
>> > 1. HugeTLB Vmemmap Optimization code would have to be refactored to
>> >    use pre-allocated pages, which is refactoring deep in HugeTLB code
>> >
>> > 2. If we want to split non-uniformly such that only the folios that are
>> >    shared are 4K, and the remaining folios are as large as possible (PMD
>> >    sized as much as possible), it gets complex to figure out how many
>> >    pages to allocate ahead of time.
>> >
>> > So it's complex and will probably delay HugeTLB+conversion support even
>> > more!
>> >
>> > > Cons:
>> > > (1) TDX's zapping callback has no idea whether the zapping is caused by an
>> > >     in-progress private-to-shared conversion or other reasons. It also has no
>> > >     idea if the attributes of the underlying folios remain unchanged during an
>> > >     in-progress private-to-shared conversion. Even if the assertion Ackerley
>> > >     mentioned is true, it's not easy to drop the sanity checks in TDX's zapping
>> > >     callback for in-progress private-to-shared conversion alone (which would
>> > >     increase TDX's dependency on guest_memfd's specific implementation even if
>> > >     it's feasible).
>> > >
>> > >     Removing the sanity checks entirely in TDX's zapping callback is confusing
>> > >     and would show a bad/false expectation from KVM -- what if a huge folio is
>> > >     incorrectly split while it's still mapped in KVM (by a buggy guest_memfd or
>> > >     others) in other conditions? And then do we still need the check in TDX's
>> > >     mapping callback? If not, does it mean TDX huge pages can stop relying on
>> > >     guest_memfd's ability to allocate huge folios, as KVM could still create
>> > >     huge mappings as long as small folios are physically contiguous with
>> > >     homogeneous memory attributes?
>> > >
>> > > (2) Allowing folios smaller than the mapping would require splitting S-EPT in
>> > >     kvm_gmem_error_folio() before kvm_gmem_zap(). Though one may argue that the
>> > >     invalidate lock held in __kvm_gmem_set_attributes() could guard against
>> > >     concurrent kvm_gmem_error_folio(), it still doesn't seem clean and looks
>> > >     error-prone. (This may also apply to kvm_gmem_migrate_folio() potentially).
>> > >
>> >
>> > I think the central question I have among all the above is what TDX
>> > needs to actually care about (putting aside what KVM's folio size/memory
>> > contiguity vs mapping level rule for a while).
>> >
>> > I think TDX code can check what it cares about (if required to aid
>> > debugging, as Dave suggested). Does TDX actually care about folio sizes,
>> > or does it actually care about memory contiguity and alignment?
>> TDX cares about memory contiguity. A single folio ensures memory contiguity.
>
> In this slightly unusual case, I think the guarantee needed here is
> that as long as a range is mapped into SEPT entries, guest_memfd
> ensures that the complete range stays private.
>
> i.e. I think it should be safe to rely on guest_memfd here,
> irrespective of the folio sizes:
> 1) KVM TDX stack should be able to reclaim the complete range when unmapping.
> 2) KVM TDX stack can assume that as long as memory is mapped in SEPT
> entries, guest_memfd will not let host userspace mappings to access
> guest private memory.
>
>>
>> Allowing one S-EPT mapping to cover multiple folios may also mean it's no longer
>> reasonable to pass "struct page" to tdh_phymem_page_wbinvd_hkid() for a
>> contiguous range larger than the page's folio range.
>
> What's the issue with passing the (struct page*, unsigned long nr_pages) pair?
>
>>
>> Additionally, we don't split private mappings in kvm_gmem_error_folio().
>> If smaller folios are allowed, splitting private mapping is required there.

It was discussed before that for memory failure handling, we will want
to split huge pages, we will get to it! The trouble is that guest_memfd
took the page from HugeTLB (unlike buddy or HugeTLB which manages memory
from the ground up), so we'll still need to figure out it's okay to let
HugeTLB deal with it when freeing, and when I last looked, HugeTLB
doesn't actually deal with poisoned folios on freeing, so there's more
work to do on the HugeTLB side.

This is a good point, although IIUC it is a separate issue. The need to
split private mappings on memory failure is not for confidentiality in
the TDX sense but to ensure that the guest doesn't use the failed
memory. In that case, contiguity is broken by the failed memory. The
folio is split, the private EPTs are split. The folio size should still
not be checked in TDX code. guest_memfd knows contiguity got broken, so
guest_memfd calls TDX code to split the EPTs.

>
> Yes, I believe splitting private mappings will be invoked to ensure
> that the whole huge folio is not unmapped from KVM due to an error on
> just a 4K page. Is that a problem?
>
> If splitting fails, the implementation can fall back to completely
> zapping the folio range.
>
>> (e.g., after splitting a 1GB folio to 4KB folios with 2MB mappings. Also, is it
>> possible for splitting a huge folio to fail partially, without merging the huge
>> folio back or further zapping?).

The current stance is to allow splitting failures and not undo that
splitting failure, so there's no merge back to fix the splitting
failure. (Not set in stone yet, I think merging back could turn out to
be a requirement from the mm side, which comes with more complexity in
restructuring logic.)

If it is not merged back on a split failure, the pages are still
contiguous, the pages are guaranteed contiguous while they are owned by
guest_memfd (even in the case of memory failure, if I get my way :P) so
TDX can still trust that.

I think you're worried that on split failure some folios are split, but
the private EPTs for those are not split, but the memory for those
unsplit private EPTs are still contiguous, and on split failure we quit
early so guest_memfd still tracks the ranges as private.

Privateness and contiguity are preserved so I think TDX should be good
with that? The TD can still run. IIUC it is part of the plan that on
splitting failure, conversion ioctl returns failure, guest is informed
of conversion failure so that it can do whatever it should do to clean
up.

>
> Yes, splitting can fail partially, but guest_memfd will not make the
> ranges available to host userspace and derivatives until:
> 1) The complete range to be converted is split to 4K granularity.
> 2) The complete range to be converted is zapped from KVM EPT mappings.
>
>> Not sure if there're other edge cases we're still missing.
>>

As you said, at the core TDX is concerned about contiguity of the memory
ranges (start_addr, length) that it was given. Contiguity is guaranteed
by guest_memfd while the folio is in guest_memfd ownership up to the
boundaries of the original folio, before any restructuring. So if we're
looking for edge cases, I think they would be around
truncation. Can't think of anything now.

(guest_memfd will also ensure truncation of anything less than the
original size of the folio before restructuring is blocked, regardless
of the current size of the folio)

>> > Separately, KVM could also enforce the folio size/memory contiguity vs
>> > mapping level rule, but TDX code shouldn't enforce KVM's rules. So if
>> > the check is deemed necessary, it still shouldn't be in TDX code, I
>> > think.
>> >
>> > > Pro: Preventing zapping private memory until conversion is successful is good.
>> > >
>> > > However, could we achieve this benefit in other ways? For example, is it
>> > > possible to ensure hugetlb_restructuring_split_folio() can't fail by ensuring
>> > > split_entries() can't fail (via pre-allocation?) and disabling hugetlb_vmemmap
>> > > optimization? (hugetlb_vmemmap conversion is super slow according to my
>> > > observation and I always disable it).
>> >
>> > HugeTLB vmemmap optimization gives us 1.6% of memory in savings. For a
>> > huge VM, multiplied by a large number of hosts, this is not a trivial
>> > amount of memory. It's one of the key reasons why we are using HugeTLB
>> > in guest_memfd in the first place, other than to be able to get high
>> > level page table mappings. We want this in production.
>> >
>> > > Or pre-allocation for
>> > > vmemmap_remap_alloc()?
>> > >
>> >
>> > Will investigate if this is possible as mentioned above. Thanks for the
>> > suggestion again!
>> >
>> > > Dropping TDX's sanity check may only serve as our last resort. IMHO, zapping
>> > > private memory before conversion succeeds is still better than introducing the
>> > > mess between folio size and mapping size.
>> > >
>> > >> > I guess perhaps the question is, is it okay if the folios are smaller
>> > >> > than the mapping while conversion is in progress? Does the order matter
>> > >> > (split page table entries first vs split folios first)?
>> > >>
>> > >> Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is
>> > >> conceptually totally fine, i.e. I'm not totally opposed to adding support for
>> > >> mapping multiple guest_memfd folios with a single hugepage.   As to whether we
>> > >> do (a) nothing, (b) change the refcounting, or (c) add support for mapping
>> > >> multiple folios in one page, probably comes down to which option provides "good
>> > >> enough" performance without incurring too much complexity.
>> >