linux-kernel - Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEvNRgGG+xYhsz62foOrTeAxUCYxpCKCJnNgTAMYMV=w2eq+6Q@mail.gmail.com>
Date: Tue, 6 Jan 2026 13:26:34 -0800
From: Ackerley Tng <ackerleytng@...gle.com>
To: Vishal Annapurve <vannapurve@...gle.com>, Yan Zhao <yan.y.zhao@...el.com>
Cc: pbonzini@...hat.com, seanjc@...gle.com, linux-kernel@...r.kernel.org, 
	kvm@...r.kernel.org, x86@...nel.org, rick.p.edgecombe@...el.com, 
	dave.hansen@...el.com, kas@...nel.org, tabba@...gle.com, michael.roth@....com, 
	david@...nel.org, sagis@...gle.com, vbabka@...e.cz, thomas.lendacky@....com, 
	nik.borisov@...e.com, pgonda@...gle.com, fan.du@...el.com, jun.miao@...el.com, 
	francescolavra.fl@...il.com, jgross@...e.com, ira.weiny@...el.com, 
	isaku.yamahata@...el.com, xiaoyao.li@...el.com, kai.huang@...el.com, 
	binbin.wu@...ux.intel.com, chao.p.peng@...el.com, chao.gao@...el.com
Subject: Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Vishal Annapurve <vannapurve@...gle.com> writes:

> On Tue, Jan 6, 2026 at 2:19 AM Yan Zhao <yan.y.zhao@...el.com> wrote:
>>
>> - EPT mapping size and folio size
>>
>>   This series is built upon the rule in KVM that the mapping size in the
>>   KVM-managed secondary MMU is no larger than the backend folio size.
>>

I'm not familiar with this rule and would like to find out more. Why is
this rule imposed? Is this rule there just because traditionally folio
sizes also define the limit of contiguity, and so the mapping size must
not be greater than folio size in case the block of memory represented
by the folio is not contiguous?

In guest_memfd's case, even if the folio is split (just for refcount
tracking purposese on private to shared conversion), the memory is still
contiguous up to the original folio's size. Will the contiguity address
the concerns?

Specifically for TDX, does the folio size actually matter relative to
the mapping, or is it more about contiguity than the folio size?

>>   Therefore, there are sanity checks in the SEAMCALL wrappers in patches
>>   1-5 to follow this rule, either in tdh_mem_page_aug() for mapping
>>   (patch 1) or in tdh_phymem_page_wbinvd_hkid() (patch 3),
>>   tdx_quirk_reset_folio() (patch 4), tdh_phymem_page_reclaim() (patch 5)
>>   for unmapping.
>>
>>   However, as Vishal pointed out in [7], the new hugetlb-based guest_memfd
>>   [1] splits backend folios ahead of notifying KVM for unmapping. So, this
>>   series also relies on the fixup patch [8] to notify KVM of unmapping
>>   before splitting the backend folio during the memory conversion ioctl.
>
> I think the major issue here is that if splitting fails there is no
> way to undo the unmapping [1]. How should KVM/VMM/guest handle the
> case where a guest requested conversion to shared, the conversion
> failed and the memory is no longer mapped as private?
>
> [1] https://lore.kernel.org/kvm/aN8P87AXlxlEDdpP@google.com/
>

Unmapping was supposed to be the point of no return in the conversion
process. (This might have changed since we last discussed this. The link
[1] from Vishal is where it was discussed.)

The new/current plan is that in the conversion process we'll do anything
that might fail first, and then commit the conversion, beginning with
zapping, and so zapping is the point of no return.

(I think you also suggested this before, but back then I couldn't see a
way to separate out the steps cleanly)

Here's the conversion steps in what we're trying now (leaving out the
TDX EPT splitting first)

1. Allocate enough memory for updating attributes maple tree
2a. Only for shared->private conversions: unmap from host page table,
check for safe refcounts
2b. Only for private->shared conversions: split folios (note: split
only, no merges) split can fail since HVO needs to be undone, and that
requires allocations.
3. Invalidate begin
4. Zap from stage 2 page tables: this is the point of no return, before
this, we must be sure nothing after will fail.
5. Update attributes maple tree using allocated memory from step 1.
6. Invalidate end
7. Only for shared->private conversions: merge folios, making sure that
merging does not fail (should not, since there are no allocations, only
folio aka metadata updates)

Updating the maple tree before calling the folio merge function allows
the merge function to look up the *updated* maple tree.

I'm thinking to insert the call to EPT splitting after invalidate begin
(3) since EPT splitting is not undoable. However, that will be after
folio splitting, hence my earlier question on whether it's a hard rule
based on folio size, or based on memory contiguity. Would that work?

>> Four issues are identified in the WIP hugetlb-based guest_memfd [1]:
>>
>> (1) Compilation error due to missing symbol export of
>>     hugetlb_restructuring_free_folio().
>>
>> (2) guest_memfd splits backend folios when the folio is still mapped as
>>     huge in KVM (which breaks KVM's basic assumption that EPT mapping size
>>     should not exceed the backend folio size).
>>
>> (3) guest_memfd is incapable of merging folios to huge for
>>     shared-to-private conversions.
>>
>> (4) Unnecessary disabling huge private mappings when HVA is not 2M-aligned,
>>     given that shared pages can only be mapped at 4KB.
>>
>> So, this series also depends on the four fixup patches included in [4]:
>>

Thank you for these fixes!

>> [FIXUP] KVM: guest_memfd: Allow gmem slot lpage even with non-aligned uaddr
>> [FIXUP] KVM: guest_memfd: Allow merging folios after to-private conversion

Thanks for catching this, Vishal also found this in a very recent
internal review. Our fix for this is to first apply the new state before
doing the folio merge. See the flow described above.

>> [FIXUP] KVM: guest_memfd: Zap mappings before splitting backend folios
>> [FIXUP] mm: hugetlb_restructuring: Export hugetlb_restructuring_free_folio()
>>
>> (lkp sent me some more gmem compilation errors. I ignored them as I didn't
>>  encounter them with my config and env).
>>
>> ...
>>
>> [0] RFC v2: https://lore.kernel.org/all/20250807093950.4395-1-yan.y.zhao@intel.com
>> [1] hugetlb-based gmem: https://github.com/googleprodkernel/linux-cc/tree/wip-gmem-conversions-hugetlb-restructuring-12-08-25
>> [2] gmem-population rework v2: https://lore.kernel.org/all/20251215153411.3613928-1-michael.roth@amd.com
>> [3] DPAMT v4: https://lore.kernel.org/kvm/20251121005125.417831-1-rick.p.edgecombe@intel.com
>> [4] kernel full stack: https://github.com/intel-staging/tdx/tree/huge_page_v3
>> [5] https://lore.kernel.org/all/aF0Kg8FcHVMvsqSo@yzhao56-desk.sh.intel.com
>> [6] https://lore.kernel.org/all/aGSoDnODoG2%2FpbYn@yzhao56-desk.sh.intel.com
>> [7] https://lore.kernel.org/all/CAGtprH9vdpAGDNtzje=7faHBQc9qTSF2fUEGcbCkfJehFuP-rw@mail.gmail.com
>> [8] https://github.com/intel-staging/tdx/commit/a8aedac2df44e29247773db3444bc65f7100daa1
>> [9] https://github.com/intel-staging/tdx/commit/8747667feb0b37daabcaee7132c398f9e62a6edd
>> [10] https://github.com/intel-staging/tdx/commit/ab29a85ec2072393ab268e231c97f07833853d0d
>> [11] https://github.com/intel-staging/tdx/commit/4feb6bf371f3a747b71fc9f4ded25261e66b8895
>>