linux-kernel - Re: [RFC PATCH v2 12/23] KVM: x86/mmu: Introduce kvm_split_cross_boundary

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aRwSkc10XQqY8RfE@yzhao56-desk.sh.intel.com>
Date: Tue, 18 Nov 2025 14:30:41 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: "Huang, Kai" <kai.huang@...el.com>
CC: "Du, Fan" <fan.du@...el.com>, "Li, Xiaoyao" <xiaoyao.li@...el.com>,
	"kvm@...r.kernel.org" <kvm@...r.kernel.org>, "Hansen, Dave"
	<dave.hansen@...el.com>, "david@...hat.com" <david@...hat.com>,
	"thomas.lendacky@....com" <thomas.lendacky@....com>, "vbabka@...e.cz"
	<vbabka@...e.cz>, "tabba@...gle.com" <tabba@...gle.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"seanjc@...gle.com" <seanjc@...gle.com>, "binbin.wu@...ux.intel.com"
	<binbin.wu@...ux.intel.com>, "kas@...nel.org" <kas@...nel.org>,
	"pbonzini@...hat.com" <pbonzini@...hat.com>, "ackerleytng@...gle.com"
	<ackerleytng@...gle.com>, "michael.roth@....com" <michael.roth@....com>,
	"Weiny, Ira" <ira.weiny@...el.com>, "Peng, Chao P" <chao.p.peng@...el.com>,
	"Yamahata, Isaku" <isaku.yamahata@...el.com>, "Annapurve, Vishal"
	<vannapurve@...gle.com>, "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>,
	"Miao, Jun" <jun.miao@...el.com>, "x86@...nel.org" <x86@...nel.org>,
	"pgonda@...gle.com" <pgonda@...gle.com>
Subject: Re: [RFC PATCH v2 12/23] KVM: x86/mmu: Introduce
 kvm_split_cross_boundary_leafs()

On Tue, Nov 18, 2025 at 08:14:17AM +0800, Huang, Kai wrote:
> On Fri, 2025-11-14 at 14:09 +0800, Yan Zhao wrote:
> > On Thu, Nov 13, 2025 at 07:02:59PM +0800, Huang, Kai wrote:
> > > On Thu, 2025-11-13 at 16:54 +0800, Yan Zhao wrote:
> > > > On Tue, Nov 11, 2025 at 06:42:55PM +0800, Huang, Kai wrote:
> > > > > On Thu, 2025-08-07 at 17:43 +0800, Yan Zhao wrote:
> > > > > >  static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> > > > > >  					 struct kvm_mmu_page *root,
> > > > > >  					 gfn_t start, gfn_t end,
> > > > > > -					 int target_level, bool shared)
> > > > > > +					 int target_level, bool shared,
> > > > > > +					 bool only_cross_bounday, bool *flush)
> > > > > >  {
> > > > > >  	struct kvm_mmu_page *sp = NULL;
> > > > > >  	struct tdp_iter iter;
> > > > > > @@ -1589,6 +1596,13 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> > > > > >  	 * level into one lower level. For example, if we encounter a 1GB page
> > > > > >  	 * we split it into 512 2MB pages.
> > > > > >  	 *
> > > > > > +	 * When only_cross_bounday is true, just split huge pages above the
> > > > > > +	 * target level into one lower level if the huge pages cross the start
> > > > > > +	 * or end boundary.
> > > > > > +	 *
> > > > > > +	 * No need to update @flush for !only_cross_bounday cases, which rely
> > > > > > +	 * on the callers to do the TLB flush in the end.
> > > > > > +	 *
> > > > > 
> > > > > s/only_cross_bounday/only_cross_boundary
> > > > > 
> > > > > From tdp_mmu_split_huge_pages_root()'s perspective, it's quite odd to only
> > > > > update 'flush' when 'only_cross_bounday' is true, because
> > > > > 'only_cross_bounday' can only results in less splitting.
> > > > I have to say it's a reasonable point.
> > > > 
> > > > > I understand this is because splitting S-EPT mapping needs flush (at least
> > > > > before non-block DEMOTE is implemented?).  Would it better to also let the
> > > > Actually the flush is only required for !TDX cases.
> > > > 
> > > > For TDX, either the flush has been performed internally within
> > > > tdx_sept_split_private_spt() 
> > > > 
> > > 
> > > AFAICT tdx_sept_split_private_spt() only does tdh_mem_track(), so KVM should
> > > still kick all vCPUs out of guest mode so other vCPUs can actually flush the
> > > TLB?
> > tdx_sept_split_private_spt() actually invokes tdx_track(), which performs the
> > kicking off all vCPUs by invoking
> > "kvm_make_all_cpus_request(kvm, KVM_REQ_OUTSIDE_GUEST_MODE)".
> 
> Oh thanks for the reminder.
> 
> Then I am kinda confused why do you need to return @flush, especially when
> 'only_cross_boundary' is true which is for TDX case?
> So step back to where why this 'flush' is needed to be returned:
> 
> - For TDX ('only_cross_boundary == true'):
> 
> The caller doesn't need to flush TLB because it has already been done when huge
> page is actually split.
> 
> - For non-TDX case ('only_cross_boundary == false'):
> 
> AFAICT the only user of tdp_mmu_split_huge_pages_root() is "eager hugepage
> splitting" during log-dirty.  And per per the current implementation there are
> two callers of tdp_mmu_split_huge_pages_root():
> 
>   kvm_mmu_try_split_huge_pages()
>   kvm_mmu_slot_try_split_huge_pages()
> 
> But they are both void functions which neither return whether flush TLB is
> needed, nor do TLB flush internally.
Actually callers of the two void functions do the TLB flush unconditionally
in the end, i.e, in
kvm_mmu_slot_apply_flags(),
kvm_clear_dirty_log_protect(), and
kvm_get_dirty_log_protect()).

> So I am kinda confused.
> 
> Perhaps you mean for "shared memory of TDX guest", the caller will also pass
> 'only_cross_boundary == true' and the caller needs to perform TLB flush?
Sorry for the confusion. 

Currently 'only_cross_boundary == true' is only for TDX private memory.

Returning flush is because kvm_split_cross_boundary_leafs() is potentially
possible to be invoked for non-TDX cases as well in future (though currently
it's only invoked for TDX alone).  When that occurs, it's better to return flush
to avoid the caller having to do flush unconditionally.

Another reason is to keep consistency with tdp_mmu_zap_leafs(), which returns
flush without differentiate whether the zap is for a mirror root not not. So,
though kvm_mmu_remote_flush() on mirror root is not necessary, it's
intentionally left for future optimization.

> [...]
> 
> > > 
> > > Something like below:
> > > 
> > > @@ -1558,7 +1558,9 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct
> > > tdp_iter *iter,
> > >  static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> > >                                          struct kvm_mmu_page *root,
> > >                                          gfn_t start, gfn_t end,
> > > -                                        int target_level, bool shared)
> > > +                                        int target_level, bool shared,
> > > +                                        bool only_cross_boundary,
> > > +                                        bool *split)
> > >  {
> > >         struct kvm_mmu_page *sp = NULL;
> > >         struct tdp_iter iter;
> > > @@ -1584,6 +1586,9 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> > >                 if (!is_shadow_present_pte(iter.old_spte) ||
> > > !is_large_pte(iter.old_spte))
> > >                         continue;
> > >  
> > > +               if (only_cross_boundary && !iter_cross_boundary(&iter, start,
> > > end))
> > > +                       continue;
> > > +
> > >                 if (!sp) {
> > >                         rcu_read_unlock();
> > >  
> > > @@ -1618,6 +1623,7 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> > >                         goto retry;
> > >  
> > >                 sp = NULL;
> > > +               *split = true;
> > >         }
> > >  
> > >         rcu_read_unlock();
> > This looks more reasonable for tdp_mmu_split_huge_pages_root();
> > 
> > Given that splitting only adds a new page to the paging structure (unlike page
> > merging), I currently can't think of any current use cases that would be broken
> > by the lack of TLB flush before tdp_mmu_iter_cond_resched() releases the
> > mmu_lock.
> > 
> > This is because:
> > 1) if the split is triggered in a fault path, the hardware shouldn't have cached
> >    the old huge translation.
> > 2) if the split is triggered in a zap or convert path,
> >    - there shouldn't be concurrent faults on the range due to the protection of
> >      mmu_invalidate_range*.
> >    - for concurrent splits on the same range, though the other vCPUs may
> >      temporally see stale huge TLB entries after they believe they have
> >      performed a split, they will be kicked off to flush the cache soon after
> >      tdp_mmu_split_huge_pages_root() returns in the first vCPU/host thread.
> >      This should be acceptable since I don't see any special guest needs that
> >      rely on pure splits.
> 
> Perhaps we should just go straight to the point:
> 
>   What does "hugepage split" do, and what's the consequence of not flushing TLB.
> 
> Per make_small_spte(), the new child PTEs will carry all bits of hugepage PTE
> except they clear the 'hugepage bit (obviously)', and set the 'X' bit for NX
> hugepage thing.
> 
> That means if we leave the stale hugepage TLB, the CPU is still able to find the
> correct PFN and AFAICT there shouldn't be any other problem here.  For any fault
> due to the stale hugepage TLB missing the 'X' permission, AFAICT KVM will just
> treat this as a spurious fault, which isn't nice but should have no harm.
Right, that isn't nice, though no harm.

Besides, I'm thinking of a scenario which is not currently existing though.

    CPU 0                                 CPU 1
a1. split pages
a2. write protect pages
                                       b1. split pages
                                       b2. write protect pages
                                       b3. start dirty page tracking
a3. flush TLB
a4. start dirty page tracking


If CPU 1 does not flush TLB after b2 (e.g., due to it finds the pages have been
split and write protected by a1&a2), it will miss some dirty pages.

Currently CPU 1 always flush TLB before b3 unconditionally, so there's no
problem.

> > So I tend to agree with your suggestion though the implementation in this patch
> > is safer.
> 
> I am perhaps still missing something, as I am still trying to precisely
> understand in what cases you want to flush TLB when splitting hugepage.
>
> I kinda tend to think you eventually want to flush TLB because eventually you
> want to _ZAP_.  But needing to flush due to zap and needing to flush due to
> split is kinda different I think.

Though I currently couldn't find any use cases that depend on split alone, e.g.
if there's any feature requiring the pages must be 4KB without any additional
permission changes, I just wanted to make the code safer in case I missed any
edge cases. 

We surely don't want the window for CPUs to see huge pages and small pages lasts
long.

Flushing TLB before releasing the mmu_lock allows other threads operating on the
same range to see updated translations timely.