linux-kernel - Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aWdgfXNdBuzpVE2Z@yzhao56-desk.sh.intel.com>
Date: Wed, 14 Jan 2026 17:23:09 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Sean Christopherson <seanjc@...gle.com>
CC: Ackerley Tng <ackerleytng@...gle.com>, Vishal Annapurve
	<vannapurve@...gle.com>, <pbonzini@...hat.com>,
	<linux-kernel@...r.kernel.org>, <kvm@...r.kernel.org>, <x86@...nel.org>,
	<rick.p.edgecombe@...el.com>, <dave.hansen@...el.com>, <kas@...nel.org>,
	<tabba@...gle.com>, <michael.roth@....com>, <david@...nel.org>,
	<sagis@...gle.com>, <vbabka@...e.cz>, <thomas.lendacky@....com>,
	<nik.borisov@...e.com>, <pgonda@...gle.com>, <fan.du@...el.com>,
	<jun.miao@...el.com>, <francescolavra.fl@...il.com>, <jgross@...e.com>,
	<ira.weiny@...el.com>, <isaku.yamahata@...el.com>, <xiaoyao.li@...el.com>,
	<kai.huang@...el.com>, <binbin.wu@...ux.intel.com>, <chao.p.peng@...el.com>,
	<chao.gao@...el.com>
Subject: Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

On Tue, Jan 13, 2026 at 05:24:36PM -0800, Sean Christopherson wrote:
> On Wed, Jan 14, 2026, Yan Zhao wrote:
> > On Mon, Jan 12, 2026 at 12:15:17PM -0800, Ackerley Tng wrote:
> > > Sean Christopherson <seanjc@...gle.com> writes:
> > > 
> > > > Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is
> > > > conceptually totally fine, i.e. I'm not totally opposed to adding support for
> > > > mapping multiple guest_memfd folios with a single hugepage. As to whether we
> > > 
> > > Sean, I'd like to clarify this.
> > > 
> > > > do (a) nothing,
> > > 
> > > What does do nothing mean here?
> 
> Don't support hugepage's for shared mappings, at least for now (as Rick pointed
> out, doing nothing now doesn't mean we can't do something in the future).
> 
> > > In this patch series the TDX functions do sanity checks ensuring that
> > > mapping size <= folio size. IIUC the checks at mapping time, like in
> > > tdh_mem_page_aug() would be fine since at the time of mapping, the
> > > mapping size <= folio size, but we'd be in trouble at the time of
> > > zapping, since that's when mapping sizes > folio sizes get discovered.
> > > 
> > > The sanity checks are in principle in direct conflict with allowing
> > > mapping of multiple guest_memfd folios at hugepage level.
> > > 
> > > > (b) change the refcounting, or
> > > 
> > > I think this is pretty hard unless something changes in core MM that
> > > allows refcounting to be customizable by the FS. guest_memfd would love
> > > to have that, but customizable refcounting is going to hurt refcounting
> > > performance throughout the kernel.
> > > 
> > > > (c) add support for mapping multiple folios in one page,
> > > 
> > > Where would the changes need to be made, IIUC there aren't any checks
> > > currently elsewhere in KVM to ensure that mapping size <= folio size,
> > > other than the sanity checks in the TDX code proposed in this series.
> > > 
> > > Does any support need to be added, or is it about amending the
> > > unenforced/unwritten rule from "mapping size <= folio size" to "mapping
> > > size <= contiguous memory size"?
> >
> > The rule is not "unenforced/unwritten". In fact, it's the de facto standard in
> > KVM.
> 
> Ya, more or less.
> 
> The rules aren't formally documented because the overarching rule is very
> simple: KVM must not map memory into the guest that the guest shouldn't have
> access to.  That falls firmly into the "well, duh" category, and so it's not
> written down anywhere :-)
> 
> How exactly KVM has honored that rule has varied over the years, and still varies
> between architectures.  In the past KVM x86 special cased HugeTLB and THP, but
> that proved to be a pain to maintain and wasn't extensible, e.g. didn't play nice
> with DAX, and so KVM x86 pivoted to pulling the mapping size from the primary MMU
> page tables.
> 
> But arm64 still special cases THP and HugeTLB, *and* VM_PFNMAP memory (eww).
> 
> > For non-gmem cases, KVM uses the mapping size in the primary MMU as the max
> > mapping size in the secondary MMU, while the primary MMU does not create a
> > mapping larger than the backend folio size.
> 
> Super strictly speaking, this might not hold true for VM_PFNMAP memory.  E.g. a
> driver _could_ split a folio (no idea why it would) but map the entire thing into
> userspace, and then userspace could have off that memory to KVM.
> 
> So I wouldn't say _KVM's_ rule isn't so much "mapping size <= folio size", it's
> that "KVM mapping size <= primary MMU mapping size", at least for x86.  Arm's
> VM_PFNMAP code sketches me out a bit, but on the other hand, a driver mapping
> discontiguous pages into a single VM_PFNMAP VMA would be even more sketch.
> 
> But yes, ignoring VM_PFNMAP, AFAIK the primary MMU and thus KVM doesn't map larger
> than the folio size.

Oh. I forgot about the VM_PFNMAP case, which allows to provide folios as the
backend. Indeed, a driver can create a huge mapping in primary MMU for the
VM_PFNMAP range with multiple discontiguous pages, if it wants.

But this occurs before KVM creates the mapping. Per my understanding, pages
under VM_PFNMAP are pinned, so it looks like there're no splits after they are
mapped into the primary MMU.

So, out of curiosity, do you know why linux kernel needs to unmap mappings from
both primary and secondary MMUs, and check folio refcount before performing
folio splitting?

> > When splitting the backend folio, the Linux kernel unmaps the folio from both
> > the primary MMU and the KVM-managed secondary MMU (through the MMU notifier).
> > 
> > On the non-KVM side, though IOMMU stage-2 mappings are allowed to be larger
> > than folio sizes, splitting folios while they are still mapped in the IOMMU
> > stage-2 page table is not permitted due to the extra folio refcount held by the
> > IOMMU.
> > 
> > For gmem cases, KVM also does not create mappings larger than the folio size
> > allocated from gmem. This is why the TDX huge page series relies on gmem's
> > ability to allocate huge folios.
> > 
> > We really need to be careful if we hope to break this long-established rule.
> 
> +100 to being careful, but at the same time I don't think we should get _too_
> fixated on the guest_memfd folio size.  E.g. similar to VM_PFNMAP, where there
> might not be a folio, if guest_memfd stopped using folios, then the entire
> discussion becomes moot.
> 
> And as above, the long-standing rule isn't about the implementation details so
> much as it is about KVM's behavior.  If the simplest solution to support huge
> guest_memfd pages is to decouple the max order from the folio, then so be it.
> 
> That said, I'd very much like to get a sense of the alternatives, because at the
> end of the day, guest_memfd needs to track the max mapping sizes _somewhere_,
> and naively, tying that to the folio seems like an easy solution.
Thanks for the explanation.

Alternatively, how do you feel about the approach of splitting S-EPT first
before splitting folios?
If guest_memfd always splits 1GB folios to 2MB first and only splits the
converted range to 4KB, splitting S-EPT before splitting folios should not
introduce too much overhead. Then, we can defer the folio size problem until
guest_memfd stops using folios.

If the decision is to stop relying on folios for unmapping now, do you think
the following changes are reasonable for the TDX huge page series?

- Add WARN_ON_ONCE() to assert that pages are in a single folio in
  tdh_mem_page_aug().
- Do not assert that pages are in a single folio in
  tdh_phymem_page_wbinvd_hkid(). (or just assert of pfn_valid() for each page?)
  Could you please give me guidance on
  https://lore.kernel.org/kvm/aWb16XJuSVuyRu7l@yzhao56-desk.sh.intel.com.
- Add S-EPT splitting in kvm_gmem_error_folio() and fail on splitting error.