linux-kernel - Re: [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aW4kd4nCjP+9Akva@yzhao56-desk.sh.intel.com>
Date: Mon, 19 Jan 2026 20:32:55 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: "Huang, Kai" <kai.huang@...el.com>, "Du, Fan" <fan.du@...el.com>,
	"kvm@...r.kernel.org" <kvm@...r.kernel.org>, "Li, Xiaoyao"
	<xiaoyao.li@...el.com>, "Hansen, Dave" <dave.hansen@...el.com>,
	"thomas.lendacky@....com" <thomas.lendacky@....com>, "tabba@...gle.com"
	<tabba@...gle.com>, "vbabka@...e.cz" <vbabka@...e.cz>, "david@...nel.org"
	<david@...nel.org>, "michael.roth@....com" <michael.roth@....com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"seanjc@...gle.com" <seanjc@...gle.com>, "Peng, Chao P"
	<chao.p.peng@...el.com>, "pbonzini@...hat.com" <pbonzini@...hat.com>,
	"ackerleytng@...gle.com" <ackerleytng@...gle.com>, "kas@...nel.org"
	<kas@...nel.org>, "binbin.wu@...ux.intel.com" <binbin.wu@...ux.intel.com>,
	"Weiny, Ira" <ira.weiny@...el.com>, "nik.borisov@...e.com"
	<nik.borisov@...e.com>, "francescolavra.fl@...il.com"
	<francescolavra.fl@...il.com>, "Yamahata, Isaku" <isaku.yamahata@...el.com>,
	"sagis@...gle.com" <sagis@...gle.com>, "Gao, Chao" <chao.gao@...el.com>,
	"Edgecombe, Rick P" <rick.p.edgecombe@...el.com>, "Miao, Jun"
	<jun.miao@...el.com>, "Annapurve, Vishal" <vannapurve@...gle.com>,
	"jgross@...e.com" <jgross@...e.com>, "pgonda@...gle.com" <pgonda@...gle.com>,
	"x86@...nel.org" <x86@...nel.org>
Subject: Re: [PATCH v3 11/24] KVM: x86/mmu: Introduce
 kvm_split_cross_boundary_leafs()

On Mon, Jan 19, 2026 at 07:06:01PM +0800, Yan Zhao wrote:
> On Mon, Jan 19, 2026 at 06:40:50PM +0800, Huang, Kai wrote:
> > On Mon, 2026-01-19 at 18:11 +0800, Yan Zhao wrote:
> > > On Mon, Jan 19, 2026 at 04:49:58PM +0800, Huang, Kai wrote:
> > > > On Mon, 2026-01-19 at 08:35 +0000, Huang, Kai wrote:
> > > > > On Mon, 2026-01-19 at 09:28 +0800, Zhao, Yan Y wrote:
> > > > > > > I find the "cross_boundary" termininology extremely confusing.  I also dislike
> > > > > > > the concept itself, in the sense that it shoves a weird, specific concept into
> > > > > > > the guts of the TDP MMU.
> > > > > > > The other wart is that it's inefficient when punching a large hole.  E.g. say
> > > > > > > there's a 16TiB guest_memfd instance (no idea if that's even possible), and then
> > > > > > > userpace punches a 12TiB hole.  Walking all ~12TiB just to _maybe_ split the head
> > > > > > > and tail pages is asinine.
> > > > > > That's a reasonable concern. I actually thought about it.
> > > > > > My consideration was as follows:
> > > > > > Currently, we don't have such large areas. Usually, the conversion ranges are
> > > > > > less than 1GB. Though the initial conversion which converts all memory from
> > > > > > private to shared may be wide, there are usually no mappings at that stage. So,
> > > > > > the traversal should be very fast (since the traversal doesn't even need to go
> > > > > > down to the 2MB/1GB level).
> > > > > > 
> > > > > > If the caller of kvm_split_cross_boundary_leafs() finds it needs to convert a
> > > > > > very large range at runtime, it can optimize by invoking the API twice:
> > > > > > once for range [start, ALIGN(start, 1GB)), and
> > > > > > once for range [ALIGN_DOWN(end, 1GB), end).
> > > > > > 
> > > > > > I can also implement this optimization within kvm_split_cross_boundary_leafs()
> > > > > > by checking the range size if you think that would be better.
> > > > > 
> > > > > I am not sure why do we even need kvm_split_cross_boundary_leafs(), if you
> > > > > want to do optimization.
> > > > > 
> > > > > I think I've raised this in v2, and asked why not just letting the caller
> > > > > to figure out the ranges to split for a given range (see at the end of
> > > > > [*]), because the "cross boundary" can only happen at the beginning and
> > > > > end of the given range, if possible.
> > > Hmm, the caller can only figure out when splitting is NOT necessary, e.g., if
> > > start is 1GB-aligned, then there's no need to split for start. However, if start
> > > is not 1GB/2MB-aligned, the caller has no idea if there's a 2MB mapping covering
> > > start - 1 and start.
> > 
> > Why does the caller need to know?
> > 
> > Let's only talk about 'start' for simplicity:
> > 
> > - If start is 1G aligned, then no split is needed.
> > 
> > - If start is not 1G-aligned but 2M-aligned, you split the range:
> > 
> >    [ALIGN_DOWN(start, 1G), ALIGN(start, 1G)) to 2M level.
> > 
> > - If start is 4K-aligned only, you firstly split
> > 
> >    [ALIGN_DOWN(start, 1G), ALIGN(start, 1G))
> > 
> >   to 2M level, then you split
> > 
> >    [ALIGN_DOWN(start, 2M), ALIGN(start, 2M))
> > 
> >   to 4K level.
> > 
> > Similar handling to 'end'.  An additional thing is if one to-be-split-
> > range calculated from 'start' overlaps one calculated from 'end', the
> > split is only needed once. 
> > 
> > Wouldn't this work?
> It can work. But I don't think the calculations are necessary if the length
> of [start, end) is less than 1G or 2MB.
> 
> e.g., if both start and end are just 4KB-aligned, of a length 8KB, the current
> implementation can invoke a single tdp_mmu_split_huge_pages_root() to split
> a 1GB mapping to 4KB directly. Why bother splitting twice for start or end?
I think I get your point now.
It's a good idea if introducing only_cross_boundary is undesirable.

So, the remaining question (as I asked at the bottom of [1]) is whether we could
create a specific function for this split use case, rather than reusing
tdp_mmu_split_huge_pages_root() which allocates pages outside of mmu_lock. This
way, we don't need to introduce a spinlock to protect the page enqueuing/
dequeueing of the per-VM external cache (see prealloc_split_cache_lock in patch
20 [2]).

Then we would disallow mirror_root for tdp_mmu_split_huge_pages_root(), which is
currently called for dirty page tracking in upstream code. Would this be
acceptable for TDX migration?


[1] https://lore.kernel.org/all/aW2Iwpuwoyod8eQc@yzhao56-desk.sh.intel.com/
[2] https://lore.kernel.org/all/20260106102345.25261-1-yan.y.zhao@intel.com/
> > > (for non-TDX cases, if start is not 1GB-aligned and is just 2MB-aligned,
> > > invoking tdp_mmu_split_huge_pages_root() is still necessary because there may
> > > exist a 1GB mapping covering start -1 and start).
> > > 
> > > In my reply to [*], I didn't want to do the calculation because I didn't see
> > > much overhead from always invoking tdp_mmu_split_huge_pages_root().
> > > But the scenario Sean pointed out is different. When both start and end are not
> > > 2MB-aligned, if [start, end) covers a huge range, we can still pre-calculate to
> > > reduce the iterations in tdp_mmu_split_huge_pages_root().
> > 
> > I don't see much difference.  Maybe I am missing something.
> The difference is the length of the range.
> For lengths < 1GB, always invoking tdp_mmu_split_huge_pages_root() without any
> calculation is simpler and more efficient.
> 
> > > 
> > > Opportunistically, optimization to skip splits for 1GB-aligned start or end is
> > > possible :)
> > 
> > If this makes code easier to review/maintain then sure.
> > 
> > As long as the solution is easy to review (i.e., not too complicated to
> > understand/maintain) then I am fine with whatever Sean/you prefer.
> > 
> > However the 'cross_boundary_only' thing was indeed a bit odd to me when I
> > firstly saw this :-)