linux-kernel - Re: [RFC PATCH v2 12/23] KVM: x86/mmu: Introduce kvm_split_cross_boundary

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aRbHtnMcoqM1gmL9@yzhao56-desk.sh.intel.com>
Date: Fri, 14 Nov 2025 14:09:58 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: "Huang, Kai" <kai.huang@...el.com>
CC: "Du, Fan" <fan.du@...el.com>, "Li, Xiaoyao" <xiaoyao.li@...el.com>,
	"kvm@...r.kernel.org" <kvm@...r.kernel.org>, "Hansen, Dave"
	<dave.hansen@...el.com>, "david@...hat.com" <david@...hat.com>,
	"thomas.lendacky@....com" <thomas.lendacky@....com>, "tabba@...gle.com"
	<tabba@...gle.com>, "vbabka@...e.cz" <vbabka@...e.cz>, "michael.roth@....com"
	<michael.roth@....com>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "seanjc@...gle.com" <seanjc@...gle.com>,
	"pbonzini@...hat.com" <pbonzini@...hat.com>, "binbin.wu@...ux.intel.com"
	<binbin.wu@...ux.intel.com>, "ackerleytng@...gle.com"
	<ackerleytng@...gle.com>, "kas@...nel.org" <kas@...nel.org>, "Weiny, Ira"
	<ira.weiny@...el.com>, "Peng, Chao P" <chao.p.peng@...el.com>, "Yamahata,
 Isaku" <isaku.yamahata@...el.com>, "Annapurve, Vishal"
	<vannapurve@...gle.com>, "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>,
	"Miao, Jun" <jun.miao@...el.com>, "x86@...nel.org" <x86@...nel.org>,
	"pgonda@...gle.com" <pgonda@...gle.com>
Subject: Re: [RFC PATCH v2 12/23] KVM: x86/mmu: Introduce
 kvm_split_cross_boundary_leafs()

On Thu, Nov 13, 2025 at 07:02:59PM +0800, Huang, Kai wrote:
> On Thu, 2025-11-13 at 16:54 +0800, Yan Zhao wrote:
> > On Tue, Nov 11, 2025 at 06:42:55PM +0800, Huang, Kai wrote:
> > > On Thu, 2025-08-07 at 17:43 +0800, Yan Zhao wrote:
> > > >  static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> > > >  					 struct kvm_mmu_page *root,
> > > >  					 gfn_t start, gfn_t end,
> > > > -					 int target_level, bool shared)
> > > > +					 int target_level, bool shared,
> > > > +					 bool only_cross_bounday, bool *flush)
> > > >  {
> > > >  	struct kvm_mmu_page *sp = NULL;
> > > >  	struct tdp_iter iter;
> > > > @@ -1589,6 +1596,13 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> > > >  	 * level into one lower level. For example, if we encounter a 1GB page
> > > >  	 * we split it into 512 2MB pages.
> > > >  	 *
> > > > +	 * When only_cross_bounday is true, just split huge pages above the
> > > > +	 * target level into one lower level if the huge pages cross the start
> > > > +	 * or end boundary.
> > > > +	 *
> > > > +	 * No need to update @flush for !only_cross_bounday cases, which rely
> > > > +	 * on the callers to do the TLB flush in the end.
> > > > +	 *
> > > 
> > > s/only_cross_bounday/only_cross_boundary
> > > 
> > > From tdp_mmu_split_huge_pages_root()'s perspective, it's quite odd to only
> > > update 'flush' when 'only_cross_bounday' is true, because
> > > 'only_cross_bounday' can only results in less splitting.
> > I have to say it's a reasonable point.
> > 
> > > I understand this is because splitting S-EPT mapping needs flush (at least
> > > before non-block DEMOTE is implemented?).  Would it better to also let the
> > Actually the flush is only required for !TDX cases.
> > 
> > For TDX, either the flush has been performed internally within
> > tdx_sept_split_private_spt() 
> > 
> 
> AFAICT tdx_sept_split_private_spt() only does tdh_mem_track(), so KVM should
> still kick all vCPUs out of guest mode so other vCPUs can actually flush the
> TLB?
tdx_sept_split_private_spt() actually invokes tdx_track(), which performs the
kicking off all vCPUs by invoking
"kvm_make_all_cpus_request(kvm, KVM_REQ_OUTSIDE_GUEST_MODE)".

> > or the flush is not required for future non-block
> > DEMOTE. So, the flush in KVM core on the mirror root may be skipped as a future
> > optimization for TDX if necessary.
> > 
> > > caller to decide whether TLB flush is needed?  E.g., we can make
> > > tdp_mmu_split_huge_pages_root() return whether any split has been done or
> > > not.  I think this should also work?
> > Do you mean just skipping the changes in the below "Hunk 1"?
> > 
> > Since tdp_mmu_split_huge_pages_root() originally did not do flush by itself,
> > which relied on the end callers (i.e.,kvm_mmu_slot_apply_flags(),
> > kvm_clear_dirty_log_protect(), and kvm_get_dirty_log_protect()) to do the flush
> > unconditionally, tdp_mmu_split_huge_pages_root() previously did not return
> > whether any split has been done or not.
> 
> Right.  But making it return any split has been done doesn't harm.
> 
> > 
> > So, if we want callers of kvm_split_cross_boundary_leafs() to do flush only
> > after splitting occurs, we have to return whether flush is required.
> 
> But assuming we always return whether "split has been done", the caller can also
> effectively know whether the flush is needed.
> 
> > 
> > Then, in this patch, seems only the changes in "Hunk 1" can be dropped.
> 
> I am thinking dropping both "Hunk 1" and "Hunk 3".  This at least makes
> kvm_split_cross_boundary_leafs() more reasonable, IMHO.
> 
> Something like below:
> 
> @@ -1558,7 +1558,9 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct
> tdp_iter *iter,
>  static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
>                                          struct kvm_mmu_page *root,
>                                          gfn_t start, gfn_t end,
> -                                        int target_level, bool shared)
> +                                        int target_level, bool shared,
> +                                        bool only_cross_boundary,
> +                                        bool *split)
>  {
>         struct kvm_mmu_page *sp = NULL;
>         struct tdp_iter iter;
> @@ -1584,6 +1586,9 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
>                 if (!is_shadow_present_pte(iter.old_spte) ||
> !is_large_pte(iter.old_spte))
>                         continue;
>  
> +               if (only_cross_boundary && !iter_cross_boundary(&iter, start,
> end))
> +                       continue;
> +
>                 if (!sp) {
>                         rcu_read_unlock();
>  
> @@ -1618,6 +1623,7 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
>                         goto retry;
>  
>                 sp = NULL;
> +               *split = true;
>         }
>  
>         rcu_read_unlock();
This looks more reasonable for tdp_mmu_split_huge_pages_root();

Given that splitting only adds a new page to the paging structure (unlike page
merging), I currently can't think of any current use cases that would be broken
by the lack of TLB flush before tdp_mmu_iter_cond_resched() releases the
mmu_lock.

This is because:
1) if the split is triggered in a fault path, the hardware shouldn't have cached
   the old huge translation.
2) if the split is triggered in a zap or convert path,
   - there shouldn't be concurrent faults on the range due to the protection of
     mmu_invalidate_range*.
   - for concurrent splits on the same range, though the other vCPUs may
     temporally see stale huge TLB entries after they believe they have
     performed a split, they will be kicked off to flush the cache soon after
     tdp_mmu_split_huge_pages_root() returns in the first vCPU/host thread.
     This should be acceptable since I don't see any special guest needs that
     rely on pure splits.

So I tend to agree with your suggestion though the implementation in this patch
is safer.

> Btw, I have to follow up this next week, since tomorrow is public holiday.
NP.