linux-kernel - Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aWhFwzlqqrnBLLiK@yzhao56-desk.sh.intel.com>
Date: Thu, 15 Jan 2026 09:41:23 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Sean Christopherson <seanjc@...gle.com>
CC: Ackerley Tng <ackerleytng@...gle.com>, Vishal Annapurve
	<vannapurve@...gle.com>, <pbonzini@...hat.com>,
	<linux-kernel@...r.kernel.org>, <kvm@...r.kernel.org>, <x86@...nel.org>,
	<rick.p.edgecombe@...el.com>, <dave.hansen@...el.com>, <kas@...nel.org>,
	<tabba@...gle.com>, <michael.roth@....com>, <david@...nel.org>,
	<sagis@...gle.com>, <vbabka@...e.cz>, <thomas.lendacky@....com>,
	<nik.borisov@...e.com>, <pgonda@...gle.com>, <fan.du@...el.com>,
	<jun.miao@...el.com>, <francescolavra.fl@...il.com>, <jgross@...e.com>,
	<ira.weiny@...el.com>, <isaku.yamahata@...el.com>, <xiaoyao.li@...el.com>,
	<kai.huang@...el.com>, <binbin.wu@...ux.intel.com>, <chao.p.peng@...el.com>,
	<chao.gao@...el.com>
Subject: Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

On Wed, Jan 14, 2026 at 07:26:44AM -0800, Sean Christopherson wrote:
> On Wed, Jan 14, 2026, Yan Zhao wrote:
> > On Tue, Jan 13, 2026 at 05:24:36PM -0800, Sean Christopherson wrote:
> > > On Wed, Jan 14, 2026, Yan Zhao wrote:
> > > > For non-gmem cases, KVM uses the mapping size in the primary MMU as the max
> > > > mapping size in the secondary MMU, while the primary MMU does not create a
> > > > mapping larger than the backend folio size.
> > > 
> > > Super strictly speaking, this might not hold true for VM_PFNMAP memory.  E.g. a
> > > driver _could_ split a folio (no idea why it would) but map the entire thing into
> > > userspace, and then userspace could have off that memory to KVM.
> > > 
> > > So I wouldn't say _KVM's_ rule isn't so much "mapping size <= folio size", it's
> > > that "KVM mapping size <= primary MMU mapping size", at least for x86.  Arm's
> > > VM_PFNMAP code sketches me out a bit, but on the other hand, a driver mapping
> > > discontiguous pages into a single VM_PFNMAP VMA would be even more sketch.
> > > 
> > > But yes, ignoring VM_PFNMAP, AFAIK the primary MMU and thus KVM doesn't map larger
> > > than the folio size.
> > 
> > Oh. I forgot about the VM_PFNMAP case, which allows to provide folios as the
> > backend. Indeed, a driver can create a huge mapping in primary MMU for the
> > VM_PFNMAP range with multiple discontiguous pages, if it wants.
> > 
> > But this occurs before KVM creates the mapping. Per my understanding, pages
> > under VM_PFNMAP are pinned,
> 
> Nope.  Only the driver that owns the VMAs knows what sits behind the PFN and the
> lifecycle rules for that memory.
> 
> That last point is *very* important.  Even if the PFNs shoved into VM_PFNMAP VMAs
> have an associated "struct page", that doesn't mean the "struct page" is refcounted,
> i.e. can be pinned.  That detail was the heart of "KVM: Stop grabbing references to
> PFNMAP'd pages" overhaul[*].
> 
> To _safely_ map VM_PFNMAP into a secondary MMU, i.e. without relying on (priveleged)
> userspace to "do the right thing", the secondary MMU needs to be tied into
> mmu_notifiers, so that modifications to the mappings in the primary MMU are
> reflected into the secondary MMU.

You are right! It maps tail page of a !compound huge page, which is not
refcounted.

> [*] https://lore.kernel.org/all/20240726235234.228822-1-seanjc@google.com
> 
> > so it looks like there're no splits after they are mapped into the primary MMU.
> > 
> > So, out of curiosity, do you know why linux kernel needs to unmap mappings from
> > both primary and secondary MMUs, and check folio refcount before performing
> > folio splitting?
> 
> Because it's a straightforward rule for the primary MMU.  Similar to guest_memfd,
> if something is going through the effort of splitting a folio, then odds are very,
> very good that the new folios can't be safely mapped as a contiguous hugepage.
> Limiting mapping sizes to folios makes the rules/behavior straightfoward for core
> MM to implement, and for drivers/users to understand.
> 
> Again like guest_memfd, there needs to be _some_ way for a driver/filesystem to
> communicate the maximum mapping size; folios are the "currency" for doing so.
> 
> And then for edge cases that want to map a split folio as a hugepage (if any such
> edge cases exist), thus take on the responsibility of managing the lifecycle of
> the mappings, VM_PFNMAP and vmf_insert_pfn() provide the necessary functionality.

Thanks for the explanation.

> > > > When splitting the backend folio, the Linux kernel unmaps the folio from both
> > > > the primary MMU and the KVM-managed secondary MMU (through the MMU notifier).
> > > > 
> > > > On the non-KVM side, though IOMMU stage-2 mappings are allowed to be larger
> > > > than folio sizes, splitting folios while they are still mapped in the IOMMU
> > > > stage-2 page table is not permitted due to the extra folio refcount held by the
> > > > IOMMU.
> > > > 
> > > > For gmem cases, KVM also does not create mappings larger than the folio size
> > > > allocated from gmem. This is why the TDX huge page series relies on gmem's
> > > > ability to allocate huge folios.
> > > > 
> > > > We really need to be careful if we hope to break this long-established rule.
> > > 
> > > +100 to being careful, but at the same time I don't think we should get _too_
> > > fixated on the guest_memfd folio size.  E.g. similar to VM_PFNMAP, where there
> > > might not be a folio, if guest_memfd stopped using folios, then the entire
> > > discussion becomes moot.
> > > 
> > > And as above, the long-standing rule isn't about the implementation details so
> > > much as it is about KVM's behavior.  If the simplest solution to support huge
> > > guest_memfd pages is to decouple the max order from the folio, then so be it.
> > > 
> > > That said, I'd very much like to get a sense of the alternatives, because at the
> > > end of the day, guest_memfd needs to track the max mapping sizes _somewhere_,
> > > and naively, tying that to the folio seems like an easy solution.
> > Thanks for the explanation.
> > 
> > Alternatively, how do you feel about the approach of splitting S-EPT first
> > before splitting folios?
> > If guest_memfd always splits 1GB folios to 2MB first and only splits the
> > converted range to 4KB, splitting S-EPT before splitting folios should not
> > introduce too much overhead. Then, we can defer the folio size problem until
> > guest_memfd stops using folios.
> > 
> > If the decision is to stop relying on folios for unmapping now, do you think
> > the following changes are reasonable for the TDX huge page series?
> > 
> > - Add WARN_ON_ONCE() to assert that pages are in a single folio in
> >   tdh_mem_page_aug().
> > - Do not assert that pages are in a single folio in
> >   tdh_phymem_page_wbinvd_hkid(). (or just assert of pfn_valid() for each page?)
> >   Could you please give me guidance on
> >   https://lore.kernel.org/kvm/aWb16XJuSVuyRu7l@yzhao56-desk.sh.intel.com.
> > - Add S-EPT splitting in kvm_gmem_error_folio() and fail on splitting error.
> 
> Ok, with the disclaimer that I hadn't actually looked at the patches in this
> series before now...
> 
> TDX absolutely should not be doing _anything_ with folios.  I am *very* strongly
> opposed to TDX assuming that memory is backed by refcounted "struct page", and
> thus can use folios to glean the maximum mapping size.
> 
> guest_memfd is _the_ owner of that information.  guest_memfd needs to explicitly
> _tell_ the rest of KVM what the maximum mapping size is; arch code should not
> infer that size from a folio.
> 
> And that code+behavior already exists in the form of kvm_gmem_mapping_order() and
> its users, _and_ is plumbed all the way into tdx_mem_page_aug() as @level.  IIUC,
> the _only_ reason tdx_mem_page_aug() retrieves the page+folio is because
> tdx_clflush_page() ultimately requires a "struct page".  That is absolutely
> ridiculous and not acceptable.  CLFLUSH takes a virtual address, there is *zero*
> reason tdh_mem_page_aug() needs to require/assume a struct page.
Not really.

Per my understanding, tdx_mem_page_aug() requires "struct page" (and checks
folios for huge pages) because the SEAMCALL wrapper APIs are not currently built
into KVM. Since they may have callers other than KVM, some sanity checking in
case the caller does something incorrect seems necessary (e.g., in case the
caller provides an out-of-range struct page or a page with !pfn_valid() PFN).
This is similar to "VM_WARN_ON_ONCE_FOLIO(!folio_test_large(folio), folio)" in
__folio_split().

With tdx_mem_page_aug() ensuring pages validity and contiguity, invoking local
static function tdx_clflush_page() page-per-page looks good to me.
Alternatively, we could convert tdx_clflush_page() to tdx_clflush_cache_range(),
which receives VA.

However, I'm not sure if my understanding is correct now, especially since it
seems like everyone thinks the SEAMCALL wrapper APIs should trust the caller,
assuming they are KVM-specific.

> Dave may feel differently, but I am not going to budge on this.  I am not going
> to bake in assumptions throughout KVM about memory being backed by page+folio.
> We _just_ cleaned up that mess in the aformentioned "Stop grabbing references to
> PFNMAP'd pages" series, I am NOT reintroducing such assumptions.
> 
> NAK to any KVM TDX code that pulls a page or folio out of a guest_memfd pfn.