linux-kernel - Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <diqzqzrzdfvh.fsf@google.com>
Date: Thu, 08 Jan 2026 12:11:14 -0800
From: Ackerley Tng <ackerleytng@...gle.com>
To: Yan Zhao <yan.y.zhao@...el.com>, Sean Christopherson <seanjc@...gle.com>
Cc: Vishal Annapurve <vannapurve@...gle.com>, pbonzini@...hat.com, 
	linux-kernel@...r.kernel.org, kvm@...r.kernel.org, x86@...nel.org, 
	rick.p.edgecombe@...el.com, dave.hansen@...el.com, kas@...nel.org, 
	tabba@...gle.com, michael.roth@....com, david@...nel.org, sagis@...gle.com, 
	vbabka@...e.cz, thomas.lendacky@....com, nik.borisov@...e.com, 
	pgonda@...gle.com, fan.du@...el.com, jun.miao@...el.com, 
	francescolavra.fl@...il.com, jgross@...e.com, ira.weiny@...el.com, 
	isaku.yamahata@...el.com, xiaoyao.li@...el.com, kai.huang@...el.com, 
	binbin.wu@...ux.intel.com, chao.p.peng@...el.com, chao.gao@...el.com
Subject: Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Yan Zhao <yan.y.zhao@...el.com> writes:

> On Tue, Jan 06, 2026 at 03:43:29PM -0800, Sean Christopherson wrote:
>> On Tue, Jan 06, 2026, Ackerley Tng wrote:
>> > Sean Christopherson <seanjc@...gle.com> writes:
>> >
>> > > On Tue, Jan 06, 2026, Ackerley Tng wrote:
>> > >> Vishal Annapurve <vannapurve@...gle.com> writes:
>> > >>
>> > >> > On Tue, Jan 6, 2026 at 2:19 AM Yan Zhao <yan.y.zhao@...el.com> wrote:
>> > >> >>
>> > >> >> - EPT mapping size and folio size
>> > >> >>
>> > >> >>   This series is built upon the rule in KVM that the mapping size in the
>> > >> >>   KVM-managed secondary MMU is no larger than the backend folio size.
>> > >> >>
>> > >>
>> > >> I'm not familiar with this rule and would like to find out more. Why is
>> > >> this rule imposed?
>> > >
>> > > Because it's the only sane way to safely map memory into the guest? :-D
>> > >
>> > >> Is this rule there just because traditionally folio sizes also define the
>> > >> limit of contiguity, and so the mapping size must not be greater than folio
>> > >> size in case the block of memory represented by the folio is not contiguous?
>> > >
>> > > Pre-guest_memfd, KVM didn't care about folios.  KVM's mapping size was (and still
>> > > is) strictly bound by the host mapping size.  That's handles contiguous addresses,
>> > > but it _also_ handles contiguous protections (e.g. RWX) and other attributes.
>> > >
>> > >> In guest_memfd's case, even if the folio is split (just for refcount
>> > >> tracking purposese on private to shared conversion), the memory is still
>> > >> contiguous up to the original folio's size. Will the contiguity address
>> > >> the concerns?
>> > >
>> > > Not really?  Why would the folio be split if the memory _and its attributes_ are
>> > > fully contiguous?  If the attributes are mixed, KVM must not create a mapping
>> > > spanning mixed ranges, i.e. with multiple folios.
>> >
>> > The folio can be split if any (or all) of the pages in a huge page range
>> > are shared (in the CoCo sense). So in a 1G block of memory, even if the
>> > attributes all read 0 (!KVM_MEMORY_ATTRIBUTE_PRIVATE), the folio
>> > would be split, and the split folios are necessary for tracking users of
>> > shared pages using struct page refcounts.
>>
>> Ahh, that's what the refcounting was referring to.  Gotcha.
>>
>> > However the split folios in that 1G range are still fully contiguous.
>> >
>> > The process of conversion will split the EPT entries soon after the
>> > folios are split so the rule remains upheld.

Correction here: If we go with splitting from 1G to 4K uniformly on
sharing, only the EPT entries around the shared 4K folio will have their
page table entries split, so many of the EPT entries will be at 2M level
though the folios are 4K sized. This would be last beyond the conversion
process.

> Overall, I don't think allowing folios smaller than the mappings while
> conversion is in progress brings enough benefit.
>

I'll look into making the restructuring process always succeed, but off
the top of my head that's hard because

1. HugeTLB Vmemmap Optimization code would have to be refactored to
   use pre-allocated pages, which is refactoring deep in HugeTLB code

2. If we want to split non-uniformly such that only the folios that are
   shared are 4K, and the remaining folios are as large as possible (PMD
   sized as much as possible), it gets complex to figure out how many
   pages to allocate ahead of time.

So it's complex and will probably delay HugeTLB+conversion support even
more!

> Cons:
> (1) TDX's zapping callback has no idea whether the zapping is caused by an
>     in-progress private-to-shared conversion or other reasons. It also has no
>     idea if the attributes of the underlying folios remain unchanged during an
>     in-progress private-to-shared conversion. Even if the assertion Ackerley
>     mentioned is true, it's not easy to drop the sanity checks in TDX's zapping
>     callback for in-progress private-to-shared conversion alone (which would
>     increase TDX's dependency on guest_memfd's specific implementation even if
>     it's feasible).
>
>     Removing the sanity checks entirely in TDX's zapping callback is confusing
>     and would show a bad/false expectation from KVM -- what if a huge folio is
>     incorrectly split while it's still mapped in KVM (by a buggy guest_memfd or
>     others) in other conditions? And then do we still need the check in TDX's
>     mapping callback? If not, does it mean TDX huge pages can stop relying on
>     guest_memfd's ability to allocate huge folios, as KVM could still create
>     huge mappings as long as small folios are physically contiguous with
>     homogeneous memory attributes?
>
> (2) Allowing folios smaller than the mapping would require splitting S-EPT in
>     kvm_gmem_error_folio() before kvm_gmem_zap(). Though one may argue that the
>     invalidate lock held in __kvm_gmem_set_attributes() could guard against
>     concurrent kvm_gmem_error_folio(), it still doesn't seem clean and looks
>     error-prone. (This may also apply to kvm_gmem_migrate_folio() potentially).
>

I think the central question I have among all the above is what TDX
needs to actually care about (putting aside what KVM's folio size/memory
contiguity vs mapping level rule for a while).

I think TDX code can check what it cares about (if required to aid
debugging, as Dave suggested). Does TDX actually care about folio sizes,
or does it actually care about memory contiguity and alignment?

Separately, KVM could also enforce the folio size/memory contiguity vs
mapping level rule, but TDX code shouldn't enforce KVM's rules. So if
the check is deemed necessary, it still shouldn't be in TDX code, I
think.

> Pro: Preventing zapping private memory until conversion is successful is good.
>
> However, could we achieve this benefit in other ways? For example, is it
> possible to ensure hugetlb_restructuring_split_folio() can't fail by ensuring
> split_entries() can't fail (via pre-allocation?) and disabling hugetlb_vmemmap
> optimization? (hugetlb_vmemmap conversion is super slow according to my
> observation and I always disable it).

HugeTLB vmemmap optimization gives us 1.6% of memory in savings. For a
huge VM, multiplied by a large number of hosts, this is not a trivial
amount of memory. It's one of the key reasons why we are using HugeTLB
in guest_memfd in the first place, other than to be able to get high
level page table mappings. We want this in production.

> Or pre-allocation for
> vmemmap_remap_alloc()?
>

Will investigate if this is possible as mentioned above. Thanks for the
suggestion again!

> Dropping TDX's sanity check may only serve as our last resort. IMHO, zapping
> private memory before conversion succeeds is still better than introducing the
> mess between folio size and mapping size.
>
>> > I guess perhaps the question is, is it okay if the folios are smaller
>> > than the mapping while conversion is in progress? Does the order matter
>> > (split page table entries first vs split folios first)?
>>
>> Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is
>> conceptually totally fine, i.e. I'm not totally opposed to adding support for
>> mapping multiple guest_memfd folios with a single hugepage.   As to whether we
>> do (a) nothing, (b) change the refcounting, or (c) add support for mapping
>> multiple folios in one page, probably comes down to which option provides "good
>> enough" performance without incurring too much complexity.