linux-kernel - Re: [RFC PATCH v2 12/23] KVM: x86/mmu: Introduce kvm_split_cross_boundary

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aR1iR4knhC52JsUO@yzhao56-desk.sh.intel.com>
Date: Wed, 19 Nov 2025 14:23:03 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: "Huang, Kai" <kai.huang@...el.com>
CC: "Du, Fan" <fan.du@...el.com>, "Li, Xiaoyao" <xiaoyao.li@...el.com>,
	"kvm@...r.kernel.org" <kvm@...r.kernel.org>, "Hansen, Dave"
	<dave.hansen@...el.com>, "david@...hat.com" <david@...hat.com>,
	"thomas.lendacky@....com" <thomas.lendacky@....com>, "tabba@...gle.com"
	<tabba@...gle.com>, "vbabka@...e.cz" <vbabka@...e.cz>, "kas@...nel.org"
	<kas@...nel.org>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "seanjc@...gle.com" <seanjc@...gle.com>,
	"pbonzini@...hat.com" <pbonzini@...hat.com>, "binbin.wu@...ux.intel.com"
	<binbin.wu@...ux.intel.com>, "ackerleytng@...gle.com"
	<ackerleytng@...gle.com>, "michael.roth@....com" <michael.roth@....com>,
	"Weiny, Ira" <ira.weiny@...el.com>, "Peng, Chao P" <chao.p.peng@...el.com>,
	"Yamahata, Isaku" <isaku.yamahata@...el.com>, "Annapurve, Vishal"
	<vannapurve@...gle.com>, "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>,
	"Miao, Jun" <jun.miao@...el.com>, "x86@...nel.org" <x86@...nel.org>,
	"pgonda@...gle.com" <pgonda@...gle.com>
Subject: Re: [RFC PATCH v2 12/23] KVM: x86/mmu: Introduce
 kvm_split_cross_boundary_leafs()

On Tue, Nov 18, 2025 at 06:49:31PM +0800, Huang, Kai wrote:
> > > So I am kinda confused.
> > > 
> > > Perhaps you mean for "shared memory of TDX guest", the caller will also pass
> > > 'only_cross_boundary == true' and the caller needs to perform TLB flush?
> > Sorry for the confusion. 
> > 
> > Currently 'only_cross_boundary == true' is only for TDX private memory.
> > 
> > Returning flush is because kvm_split_cross_boundary_leafs() is potentially
> > possible to be invoked for non-TDX cases as well in future (though currently
> > it's only invoked for TDX alone).  When that occurs, it's better to return flush
> > to avoid the caller having to do flush unconditionally.
> 
> Exactly what "future" cases are you referring to?
> 
> Why do we need to consider it *NOW*?
The API kvm_split_cross_boundary_leafs() does not force that the root must be
a mirror root. So, it's better to return "split" status and let the caller
decide whether to invoke kvm_flush_remote_tlbs(). This is for code completeness
and consistency as explained in [1].

[1] https://lore.kernel.org/all/aR08f%2Fn7j0RyGlUn@yzhao56-desk.sh.intel.com/
 
> > 
> > Another reason is to keep consistency with tdp_mmu_zap_leafs(), which returns
> > flush without differentiate whether the zap is for a mirror root not not. So,
> > though kvm_mmu_remote_flush() on mirror root is not necessary, it's
> > intentionally left for future optimization.
> 
> You mean non-blocking DEMOTE won't need to flush TLB internally when splitting
> but the caller needs to do the flush?
I mean as an API implemented in KVM core, it's better to have
kvm_split_cross_boundary_leafs() to make any assumption on whether TDX
internally have performed the TLB flush or whether non-blocking DEMOTE needs
flush.

We can return "split" and let callers decide flush or not.

The optimization to avoid invoking kvm_flush_remote_tlbs() for zaps/splits on a
mirror root alone can be implemented in the future when necessary.

> Anyway, all of above are not mentioned in the changelog.  I think we need a
> clear explanation in the changelog to justify the change.
Will do.

> > > [...]
> > > 
> > > > > 
> > > > > Something like below:
> > > > > 
> > > > > @@ -1558,7 +1558,9 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct
> > > > > tdp_iter *iter,
> > > > >  static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> > > > >                                          struct kvm_mmu_page *root,
> > > > >                                          gfn_t start, gfn_t end,
> > > > > -                                        int target_level, bool shared)
> > > > > +                                        int target_level, bool shared,
> > > > > +                                        bool only_cross_boundary,
> > > > > +                                        bool *split)
> > > > >  {
> > > > >         struct kvm_mmu_page *sp = NULL;
> > > > >         struct tdp_iter iter;
> > > > > @@ -1584,6 +1586,9 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> > > > >                 if (!is_shadow_present_pte(iter.old_spte) ||
> > > > > !is_large_pte(iter.old_spte))
> > > > >                         continue;
> > > > >  
> > > > > +               if (only_cross_boundary && !iter_cross_boundary(&iter, start,
> > > > > end))
> > > > > +                       continue;
> > > > > +
> > > > >                 if (!sp) {
> > > > >                         rcu_read_unlock();
> > > > >  
> > > > > @@ -1618,6 +1623,7 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> > > > >                         goto retry;
> > > > >  
> > > > >                 sp = NULL;
> > > > > +               *split = true;
> > > > >         }
> > > > >  
> > > > >         rcu_read_unlock();
> > > > This looks more reasonable for tdp_mmu_split_huge_pages_root();
> > > > 
> > > > Given that splitting only adds a new page to the paging structure (unlike page
> > > > merging), I currently can't think of any current use cases that would be broken
> > > > by the lack of TLB flush before tdp_mmu_iter_cond_resched() releases the
> > > > mmu_lock.
> > > > 
> > > > This is because:
> > > > 1) if the split is triggered in a fault path, the hardware shouldn't have cached
> > > >    the old huge translation.
> > > > 2) if the split is triggered in a zap or convert path,
> > > >    - there shouldn't be concurrent faults on the range due to the protection of
> > > >      mmu_invalidate_range*.
> > > >    - for concurrent splits on the same range, though the other vCPUs may
> > > >      temporally see stale huge TLB entries after they believe they have
> > > >      performed a split, they will be kicked off to flush the cache soon after
> > > >      tdp_mmu_split_huge_pages_root() returns in the first vCPU/host thread.
> > > >      This should be acceptable since I don't see any special guest needs that
> > > >      rely on pure splits.
> > > 
> > > Perhaps we should just go straight to the point:
> > > 
> > >   What does "hugepage split" do, and what's the consequence of not flushing TLB.
> > > 
> > > Per make_small_spte(), the new child PTEs will carry all bits of hugepage PTE
> > > except they clear the 'hugepage bit (obviously)', and set the 'X' bit for NX
> > > hugepage thing.
> > > 
> > > That means if we leave the stale hugepage TLB, the CPU is still able to find the
> > > correct PFN and AFAICT there shouldn't be any other problem here.  For any fault
> > > due to the stale hugepage TLB missing the 'X' permission, AFAICT KVM will just
> > > treat this as a spurious fault, which isn't nice but should have no harm.
> > Right, that isn't nice, though no harm.
> > 
> > Besides, I'm thinking of a scenario which is not currently existing though.
> > 
> >     CPU 0                                 CPU 1
> > a1. split pages
> > a2. write protect pages
> >                                        b1. split pages
> >                                        b2. write protect pages
> >                                        b3. start dirty page tracking
> > a3. flush TLB
> > a4. start dirty page tracking
> > 
> > 
> > If CPU 1 does not flush TLB after b2 (e.g., due to it finds the pages have been
> > split and write protected by a1&a2), it will miss some dirty pages.
> 
> Do you have any actual concrete plan to foresee this is likely to happen in the
> future?  E.g., why CPU1 wants to skip TLB flush after b2 due to a1&a2 etc?
> 
> To be honest I don't think we should discuss those hypothetical problems.
Sorry about the confusion.

I just wanted to express why I thought it's safer to do flush before releasing
mmu_lock. However, as I mentioned in the previous replies, I can't find any
current use cases impacted by skipping this flush. So I think it's ok not to
flush before releasing mmu_lock.

Will update the patch comment to explain why the skipping of flush is ok.
(I think current upstream tdp_mmu_split_huge_pages_root() lacks a comment of why
it's safe not to do the flush before releasing mmu_lock).

> > Currently CPU 1 always flush TLB before b3 unconditionally, so there's no
> > problem.
> > 
> > > > So I tend to agree with your suggestion though the implementation in this patch
> > > > is safer.
> > > 
> > > I am perhaps still missing something, as I am still trying to precisely
> > > understand in what cases you want to flush TLB when splitting hugepage.
> > > 
> > > I kinda tend to think you eventually want to flush TLB because eventually you
> > > want to _ZAP_.  But needing to flush due to zap and needing to flush due to
> > > split is kinda different I think.
> > 
> > Though I currently couldn't find any use cases that depend on split alone, e.g.
> > if there's any feature requiring the pages must be 4KB without any additional
> > permission changes, I just wanted to make the code safer in case I missed any
> > edge cases. 
> > 
> > We surely don't want the window for CPUs to see huge pages and small pages lasts
> > long.
> > 
> > Flushing TLB before releasing the mmu_lock allows other threads operating on the
> > same range to see updated translations timely.
> 
> In the upstream code most callers of tdp_mmu_iter_cond_resched() call it w/o
> flushing TLB when yield happens, so the "window of stale TLB" already exists --
> it's just not stale hugepage TLBs, but other stale TLBs.
> 
> But I agree it's not good to have stale TLBs, and looking at
> recover_huge_pages_range(), it also does TLB flush when yielding if there's
> already hugepage merge happened.
> 
> So if you want to make tdp_mmu_split_huge_pages_root() handle TLB flush, perhaps
> we can make it like recover_huge_pages_range().  But AFAICT we also want to make
> tdp_mmu_split_huge_pages_root() return whether flush is needed, but not actually
> perform TLB flush for non-yielding case, because otherwise we need to revisit
> the log-dirty code to avoid duplicated TLB flush.
> 
> And then the 'only_cross_boundary' can be added to it.
> 
> Btw, a second thought on the 'only_cross_boundary':
> 
> My first glance of 'only_cross_boundary' was it's a little bit odd, because you
> actually only need to split the hugepage where 'start' and 'end' is in middle of
> a hugepage.
> 
> So alternatively, instead of yet adding another 'only_cross_boundary' to
> tdp_mmu_split_huge_pages_root(), I think we can also make the caller check the
> range and only call tdp_mmu_split_huge_pages_root() when the range crosses the
> hugepage boundary?
I don't think it's good.

- The code to check if a range crosses hugepage boundary is cumbersome and level
  dependent. As you point out below, [1G, 1G + 2M) does not need splitting if
  there's only 2M mappings, but it needs splitting when there're 1G mappings.
 
  For an API implemented in KVM core, it's better not to assume there're no 1G
  mappings. Not to mention TDX itself will support 1G in future.

- When a range is determined as "truly needs splitting", e.g. [1G-8K, 1G+8K),
  tdp_mmu_split_huge_pages_root() still needs to return 'split' status since
  splitting may not occur due to no present mappings or no 2M mappings.

> E.g., for a range [1G, 2G), it's doesn't cross any 2M boundary, thus the caller
> can skip calling tdp_mmu_split_huge_pages_root().  If the range is [1G + 1M,
> 2G), then the caller can know only the first [1G, 1G + 2M) needs splitting. 
> This also saves unnecessary iter walk for the rest range [1G + 2M, 2G).
> 
> I think if we only consider 2M hugepage but not 1G page, then it should not be
> that complicated to check the range and only call
> tdp_mmu_split_huge_pages_root() for the range that is truly needs splitting?