linux-kernel - Re: [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aXHDYWN7AlK/KqoO@yzhao56-desk.sh.intel.com>
Date: Thu, 22 Jan 2026 14:27:45 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Sean Christopherson <seanjc@...gle.com>
CC: Kai Huang <kai.huang@...el.com>, "pbonzini@...hat.com"
	<pbonzini@...hat.com>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>, Fan Du
	<fan.du@...el.com>, Xiaoyao Li <xiaoyao.li@...el.com>, Chao Gao
	<chao.gao@...el.com>, Dave Hansen <dave.hansen@...el.com>,
	"thomas.lendacky@....com" <thomas.lendacky@....com>, "vbabka@...e.cz"
	<vbabka@...e.cz>, "tabba@...gle.com" <tabba@...gle.com>, "david@...nel.org"
	<david@...nel.org>, "kas@...nel.org" <kas@...nel.org>, "michael.roth@....com"
	<michael.roth@....com>, Ira Weiny <ira.weiny@...el.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"binbin.wu@...ux.intel.com" <binbin.wu@...ux.intel.com>,
	"ackerleytng@...gle.com" <ackerleytng@...gle.com>, "nik.borisov@...e.com"
	<nik.borisov@...e.com>, Isaku Yamahata <isaku.yamahata@...el.com>, "Chao P
 Peng" <chao.p.peng@...el.com>, "francescolavra.fl@...il.com"
	<francescolavra.fl@...il.com>, "sagis@...gle.com" <sagis@...gle.com>, "Vishal
 Annapurve" <vannapurve@...gle.com>, Rick P Edgecombe
	<rick.p.edgecombe@...el.com>, Jun Miao <jun.miao@...el.com>,
	"jgross@...e.com" <jgross@...e.com>, "pgonda@...gle.com" <pgonda@...gle.com>,
	"x86@...nel.org" <x86@...nel.org>
Subject: Re: [PATCH v3 11/24] KVM: x86/mmu: Introduce
 kvm_split_cross_boundary_leafs()

On Tue, Jan 20, 2026 at 09:51:06AM -0800, Sean Christopherson wrote:
> On Mon, Jan 19, 2026, Yan Zhao wrote:
> > On Fri, Jan 16, 2026 at 03:39:13PM -0800, Sean Christopherson wrote:
> > > On Thu, Jan 15, 2026, Kai Huang wrote:
> > > > So how about:
> > > > 
> > > > Rename kvm_tdp_mmu_try_split_huge_pages() to
> > > > kvm_tdp_mmu_split_huge_pages_log_dirty(), and rename
> > > > kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() to
> > > > kvm_tdp_mmu_split_huge_pages_cross_boundary()
> > > > 
> > > > ?
> > > 
> > > I find the "cross_boundary" termininology extremely confusing.  I also dislike
> > > the concept itself, in the sense that it shoves a weird, specific concept into
> > > the guts of the TDP MMU.
> > > The other wart is that it's inefficient when punching a large hole.  E.g. say
> > > there's a 16TiB guest_memfd instance (no idea if that's even possible), and then
> > > userpace punches a 12TiB hole.  Walking all ~12TiB just to _maybe_ split the head
> > > and tail pages is asinine.
> > That's a reasonable concern. I actually thought about it.
> > My consideration was as follows:
> > Currently, we don't have such large areas. Usually, the conversion ranges are
> > less than 1GB. 
> 
> Nothing guarantees that behavior.
> 
> > Though the initial conversion which converts all memory from private to
> > shared may be wide, there are usually no mappings at that stage. So, the
> > traversal should be very fast (since the traversal doesn't even need to go
> > down to the 2MB/1GB level).
> > 
> > If the caller of kvm_split_cross_boundary_leafs() finds it needs to convert a
> > very large range at runtime, it can optimize by invoking the API twice:
> > once for range [start, ALIGN(start, 1GB)), and
> > once for range [ALIGN_DOWN(end, 1GB), end).
> > 
> > I can also implement this optimization within kvm_split_cross_boundary_leafs()
> > by checking the range size if you think that would be better.
> > 
> > > And once kvm_arch_pre_set_memory_attributes() is dropped, I'm pretty sure the
> > > _only_ usage is for guest_memfd PUNCH_HOLE, because unless I'm misreading the
> > > code, the usage in tdx_honor_guest_accept_level() is superfluous and confusing.
> > Sorry for the confusion about the usage of tdx_honor_guest_accept_level(). I
> > should add a better comment.
> > 
> > There are 4 use cases for the API kvm_split_cross_boundary_leafs():
> > 1. PUNCH_HOLE
> > 2. KVM_SET_MEMORY_ATTRIBUTES2, which invokes kvm_gmem_set_attributes() for
> >    private-to-shared conversions
> > 3. tdx_honor_guest_accept_level()
> > 4. kvm_gmem_error_folio()
> > 
> > Use cases 1-3 are already in the current code. Use case 4 is per our discussion,
> > and will be implemented in the next version (because guest_memfd may split
> > folios without first splitting S-EPT).
> > 
> > The 4 use cases can be divided into two categories:
> > 
> > 1. Category 1: use cases 1, 2, 4
> >    We must ensure GFN start - 1 and GFN start are not mapped in a single
> >    mapping. However, for GFN start or GFN start - 1 specifically, we don't care
> >    about their actual mapping levels, which means they are free to be mapped at
> >    2MB or 1GB. The same applies to GFN end - 1 and GFN end.
> > 
> >    --|------------------|-----------
> >      ^                  ^
> >     start              end - 1 
> > 
> > 2. Category 2: use case 3
> >    It cares about the mapping level of the GFN, i.e., it must not be mapped
> >    above a certain level.
> > 
> >    -----|-------
> >         ^
> >        GFN
> > 
> >    So, to unify the two categories, I have tdx_honor_guest_accept_level() check
> >    the range of [level-aligned GFN, level-aligned GFN + level size). e.g.,
> >    If the accept level is 2MB, only 1GB mapping is possible to be outside the
> >    range and needs splitting.
> 
> But that overlooks the fact that Category 2 already fits the existing "category"
> that is supported by the TDP MMU.  I.e. Category 1 is (somewhat) new and novel,
> Category 2 is not.
> 
> >    -----|-------------|---
> >         ^             ^
> >         |             |
> >    level-aligned     level-aligned
> >       GFN            GFN + level size - 1
> > 
> > 
> > > For the EPT violation case, the guest is accepting a page.  Just split to the
> > > guest's accepted level, I don't see any reason to make things more complicated
> > > than that.
> > This use case could reuse the kvm_mmu_try_split_huge_pages() API, except that we
> > need a return value.
> 
> Just expose tdp_mmu_split_huge_pages_root(), the fault path only _needs_ to split
> the current root, and in fact shouldn't even try to split other roots (ignoring
> that no other relevant roots exist).
Ok.

> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 9c26038f6b77..7d924da75106 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1555,10 +1555,9 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
>         return ret;
>  }
>  
> -static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> -                                        struct kvm_mmu_page *root,
> -                                        gfn_t start, gfn_t end,
> -                                        int target_level, bool shared)
> +int tdp_mmu_split_huge_pages_root(struct kvm *kvm, struct kvm_mmu_page *root,
> +                                 gfn_t start, gfn_t end, int target_level,
> +                                 bool shared)
>  {
>         struct kvm_mmu_page *sp = NULL;
>         struct tdp_iter iter;
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index bd62977c9199..ea9a509608fb 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -93,6 +93,9 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
>                                    struct kvm_memory_slot *slot, gfn_t gfn,
>                                    int min_level);
>  
> +int tdp_mmu_split_huge_pages_root(struct kvm *kvm, struct kvm_mmu_page *root,
> +                                 gfn_t start, gfn_t end, int target_level,
> +                                 bool shared);
>  void kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
>                                       const struct kvm_memory_slot *slot,
>                                       gfn_t start, gfn_t end,
> 
> > > And then for the PUNCH_HOLE case, do the math to determine which, if any, head
> > > and tail pages need to be split, and use the existing APIs to make that happen.
> > This use case cannot reuse kvm_mmu_try_split_huge_pages() without modification.
> 
> Modifying existing code is a non-issue, and you're already modifying TDP MMU
> functions, so I don't see that as a reason for choosing X instead of Y.
> 
> > Or which existing APIs are you referring to?
> 
> See above.
Ok. Do you like the idea of introducing only_cross_boundary (or something with a
different name) to tdp_mmu_split_huge_pages_root() ?
If not, could I expose a helper to help range calculate?


> > The cross_boundary information is still useful?
> > 
> > BTW: Currently, kvm_split_cross_boundary_leafs() internally reuses
> > tdp_mmu_split_huge_pages_root() (as shown below).
> > 
> > kvm_split_cross_boundary_leafs
> >   kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs
> >     tdp_mmu_split_huge_pages_root
> > 
> > However, tdp_mmu_split_huge_pages_root() is originally used to split huge
> > mappings in a wide range, so it temporarily releases mmu_lock for memory
> > allocation for sp, since it can't predict how many pages to pre-allocate in the
> > KVM mmu cache.
> > 
> > For kvm_split_cross_boundary_leafs(), we can actually predict the max number of
> > pages to pre-allocate. If we don't reuse tdp_mmu_split_huge_pages_root(), we can
> > allocate sp, sp->spt, sp->external_spt and DPAMT pages from the KVM mmu cache
> > without releasing mmu_lock and invoking tdp_mmu_alloc_sp_for_split().
> 
> That's completely orthogonal to the "only need to maybe split head and tail pages".
> E.g. kvm_tdp_mmu_try_split_huge_pages() can also predict the _max_ number of pages
> to pre-allocate, it's just not worth adding a kvm_mmu_memory_cache for that use
> case because that path can drop mmu_lock at will, unlike the full page fault path.
> I.e. the complexity doesn't justify the benefits, especially since the max number
> of pages is so large.
Right, it's technically feasible, but practically not.
If to split a huge range down to 4KB, like 16G, the _max_ number of pages to
pre-allocate is too large. It's 16*512=8192 pages without TDX.

> AFAICT, the only pre-allocation that is _necessary_ is for the dynamic PAMT,
Yes, patch 20 in this series just pre-allocates DPAMT pages for splitting.

See:
static int tdx_min_split_cache_sz(struct kvm *kvm, int level)
{
	KVM_BUG_ON(level != PG_LEVEL_2M, kvm);

	if (!tdx_supports_dynamic_pamt(tdx_sysinfo))
		return 0;

	return tdx_dpamt_entry_pages() * 2;
}

> because the allocation is done outside of KVM's control.  But that's a solvable
> problem, the tricky part is protecting the PAMT cache for PUNCH_HOLE, but that
> too is solvable, e.g. by adding a per-VM mutex that's taken by kvm_gmem_punch_hole()
I don't get why only PUNCH_HOLE case needs to be protected.
It's not guaranteed that KVM_SET_MEMORY_ATTRIBUTES2 ioctls are in vCPU contexts.

BTW: the split_external_spte() hook is invoked under mmu_lock, so I used a
spinlock kvm_tdx->prealloc_split_cache_lock to protect cache enqueuing
and dequeuing.

> to handle the PUNCH_HOLE case, and then using the per-vCPU cache when splitting
> for a mismatched accept.
Yes, I planned to use per-vCPU cache in the future. e.g., when splitting under
shared mmu_lock to honor guest accept level.