linux-kernel - Re: [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aW2Iwpuwoyod8eQc@yzhao56-desk.sh.intel.com>
Date: Mon, 19 Jan 2026 09:28:34 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Sean Christopherson <seanjc@...gle.com>
CC: Kai Huang <kai.huang@...el.com>, "pbonzini@...hat.com"
	<pbonzini@...hat.com>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>, Fan Du
	<fan.du@...el.com>, Xiaoyao Li <xiaoyao.li@...el.com>, Chao Gao
	<chao.gao@...el.com>, Dave Hansen <dave.hansen@...el.com>,
	"thomas.lendacky@....com" <thomas.lendacky@....com>, "vbabka@...e.cz"
	<vbabka@...e.cz>, "tabba@...gle.com" <tabba@...gle.com>, "david@...nel.org"
	<david@...nel.org>, "kas@...nel.org" <kas@...nel.org>, "michael.roth@....com"
	<michael.roth@....com>, Ira Weiny <ira.weiny@...el.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"binbin.wu@...ux.intel.com" <binbin.wu@...ux.intel.com>,
	"ackerleytng@...gle.com" <ackerleytng@...gle.com>, "nik.borisov@...e.com"
	<nik.borisov@...e.com>, Isaku Yamahata <isaku.yamahata@...el.com>, "Chao P
 Peng" <chao.p.peng@...el.com>, "francescolavra.fl@...il.com"
	<francescolavra.fl@...il.com>, "sagis@...gle.com" <sagis@...gle.com>, "Vishal
 Annapurve" <vannapurve@...gle.com>, Rick P Edgecombe
	<rick.p.edgecombe@...el.com>, Jun Miao <jun.miao@...el.com>,
	"jgross@...e.com" <jgross@...e.com>, "pgonda@...gle.com" <pgonda@...gle.com>,
	"x86@...nel.org" <x86@...nel.org>
Subject: Re: [PATCH v3 11/24] KVM: x86/mmu: Introduce
 kvm_split_cross_boundary_leafs()

On Fri, Jan 16, 2026 at 03:39:13PM -0800, Sean Christopherson wrote:
> On Thu, Jan 15, 2026, Kai Huang wrote:
> > static int __kvm_tdp_mmu_split_huge_pages(struct kvm *kvm, 
> > 					  struct kvm_gfn_range *range,
> > 					  int target_level,
> > 					  bool shared,
> > 					  bool cross_boundary_only)
> > {
> > 	...
> > }
> > 
> > And by using this helper, I found the name of the two wrapper functions
> > are not ideal:
> > 
> > kvm_tdp_mmu_try_split_huge_pages() is only for log dirty, and it should
> > not be reachable for TD (VM with mirrored PT).  But currently it uses
> > KVM_VALID_ROOTS for root filter thus mirrored PT is also included.  I
> > think it's better to rename it, e.g., at least with "log_dirty" in the
> > name so it's more clear this function is only for dealing log dirty (at
> > least currently).  We can also add a WARN() if it's called for VM with
> > mirrored PT but it's a different topic.
> > 
> > kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() doesn't have
> > "huge_pages", which isn't consistent with the other.  And it is a bit
> > long.  If we don't have "gfn_range" in __kvm_tdp_mmu_split_huge_pages(),
> > then I think we can remove "gfn_range" from
> > kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() too to make it shorter.
> > 
> > So how about:
> > 
> > Rename kvm_tdp_mmu_try_split_huge_pages() to
> > kvm_tdp_mmu_split_huge_pages_log_dirty(), and rename
> > kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() to
> > kvm_tdp_mmu_split_huge_pages_cross_boundary()
> > 
> > ?
> 
> I find the "cross_boundary" termininology extremely confusing.  I also dislike
> the concept itself, in the sense that it shoves a weird, specific concept into
> the guts of the TDP MMU.
> The other wart is that it's inefficient when punching a large hole.  E.g. say
> there's a 16TiB guest_memfd instance (no idea if that's even possible), and then
> userpace punches a 12TiB hole.  Walking all ~12TiB just to _maybe_ split the head
> and tail pages is asinine.
That's a reasonable concern. I actually thought about it.
My consideration was as follows:
Currently, we don't have such large areas. Usually, the conversion ranges are
less than 1GB. Though the initial conversion which converts all memory from
private to shared may be wide, there are usually no mappings at that stage. So,
the traversal should be very fast (since the traversal doesn't even need to go
down to the 2MB/1GB level).

If the caller of kvm_split_cross_boundary_leafs() finds it needs to convert a
very large range at runtime, it can optimize by invoking the API twice:
once for range [start, ALIGN(start, 1GB)), and
once for range [ALIGN_DOWN(end, 1GB), end).

I can also implement this optimization within kvm_split_cross_boundary_leafs()
by checking the range size if you think that would be better.

> And once kvm_arch_pre_set_memory_attributes() is dropped, I'm pretty sure the
> _only_ usage is for guest_memfd PUNCH_HOLE, because unless I'm misreading the
> code, the usage in tdx_honor_guest_accept_level() is superfluous and confusing.
Sorry for the confusion about the usage of tdx_honor_guest_accept_level(). I
should add a better comment.

There are 4 use cases for the API kvm_split_cross_boundary_leafs():
1. PUNCH_HOLE
2. KVM_SET_MEMORY_ATTRIBUTES2, which invokes kvm_gmem_set_attributes() for
   private-to-shared conversions
3. tdx_honor_guest_accept_level()
4. kvm_gmem_error_folio()

Use cases 1-3 are already in the current code. Use case 4 is per our discussion,
and will be implemented in the next version (because guest_memfd may split
folios without first splitting S-EPT).

The 4 use cases can be divided into two categories:

1. Category 1: use cases 1, 2, 4
   We must ensure GFN start - 1 and GFN start are not mapped in a single
   mapping. However, for GFN start or GFN start - 1 specifically, we don't care
   about their actual mapping levels, which means they are free to be mapped at
   2MB or 1GB. The same applies to GFN end - 1 and GFN end.

   --|------------------|-----------
     ^                  ^
    start              end - 1 

2. Category 2: use case 3
   It cares about the mapping level of the GFN, i.e., it must not be mapped
   above a certain level.

   -----|-------
        ^
       GFN

   So, to unify the two categories, I have tdx_honor_guest_accept_level() check
   the range of [level-aligned GFN, level-aligned GFN + level size). e.g.,
   If the accept level is 2MB, only 1GB mapping is possible to be outside the
   range and needs splitting.

   -----|-------------|---
        ^             ^
        |             |
   level-aligned     level-aligned
      GFN            GFN + level size - 1


> For the EPT violation case, the guest is accepting a page.  Just split to the
> guest's accepted level, I don't see any reason to make things more complicated
> than that.
This use case could reuse the kvm_mmu_try_split_huge_pages() API, except that we
need a return value.

> And then for the PUNCH_HOLE case, do the math to determine which, if any, head
> and tail pages need to be split, and use the existing APIs to make that happen.
This use case cannot reuse kvm_mmu_try_split_huge_pages() without modification.
Or which existing APIs are you referring to?
The cross_boundary information is still useful?

BTW: Currently, kvm_split_cross_boundary_leafs() internally reuses
tdp_mmu_split_huge_pages_root() (as shown below).

kvm_split_cross_boundary_leafs
  kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs
    tdp_mmu_split_huge_pages_root

However, tdp_mmu_split_huge_pages_root() is originally used to split huge
mappings in a wide range, so it temporarily releases mmu_lock for memory
allocation for sp, since it can't predict how many pages to pre-allocate in the
KVM mmu cache.

For kvm_split_cross_boundary_leafs(), we can actually predict the max number of
pages to pre-allocate. If we don't reuse tdp_mmu_split_huge_pages_root(), we can
allocate sp, sp->spt, sp->external_spt and DPAMT pages from the KVM mmu cache
without releasing mmu_lock and invoking tdp_mmu_alloc_sp_for_split(). Do you
think this approach is better?